Rosaria Silipo, KNIME
Hi, Rosaria, we’re glad you agreed to be our interview guest. The data suggest women are still underrepresented in the field, so it would be very interesting to hear more about your data science professional experience. But let’s find out more about you first. Could you briefly introduce yourself?
Hi, I’m Rosaria Silipo, and I currently work as a principal data scientist at KNIME. I earned my doctorate degree in biomedical engineering in 1996. I have been working in the field of data analytics since my master thesis, which tallies up to some 25 years (I started my master thesis at the end of 1990). I’ve been through the many hypes (and names) of data analytics, and I’ve worked with many different tools, including my own programming.
I think the description on my Twitter profile summarizes my journey across time in the data science field well: “I loved data before it was big, and I loved learning before it was deep.”
I loved data before it was big, and I loved learning before it was deep.
As a woman in the field, have you ever felt that there's a difference between your professional opportunities and that of fellow male colleagues? Do you think there’s prejudice against women when it comes to finding new clients? Or do you believe the tide is finally turning for women who work or wish to work as data scientists?
This question is always very difficult to answer. I have worked and still work in collaboration with a number of people from different countries and different cultures. Of course, there have been different attitudes toward me and my work over the years, as there have been and still are different attitudes toward the whole data science field. Sometimes the mistrust has come from people I would have never expected. Other times, though, collaboration with colleagues and customers with questionable stereotypical reputations has been glitch-less.
On average, I can say that the situation is very different now in comparison to 25-30 years ago. Women data scientists are definitely perceived to be much more competent and trustworthy now. Has the tide turned? I cannot say. I only have one sample data point (myself) which can hardly be generalized to the thousands or even millions of (often younger) women data scientists in the whole world.
That's understandable. Well, we sure hope the tide is turning, because women have a lot to contribute to the field. Now, let’s take you on a journey to the past. Do you remember how you started working with data?
I recall it started with my master thesis. Back then, I wanted to use machine learning algorithms - at that time mainly neural networks trained with backpropagation, to automatically classify heart beats in an electrocardiographic signal. The problem at first seemed relatively easy. Just two classes: normal beats and arrhythmic beats.
However, when I dug deeper, I discovered that arrhythmic beats could come in many forms and shapes, depending on the pathology. Thus, the binomial classification problem quickly became a multinomial classification problem.
Furthermore, in some ECG tracks, I also had to deal with unknown sudden unclassified and unrepresented heartbeat shapes.
So, this is how I got introduced to anomaly detection problems. In addition, some pathologies did not produce a dramatically different heartbeat shape but just a slight drift of the signal over time. This was the realm of time series analysis. And so on … as one problem was somehow solved, a new problem arose. And I haven’t stopped digging deeper since then.
Yes, data science is a lifelong learning process. As an experienced data scientist, you probably have quite a few victories. Could you share with our readers what highlights in your career are you most proud of so far?
Sure. It’s true I’ve worked on some successful data science projects over the years. To tell you the truth, I’m quite proud of all of them. However, my biggest achievements are actually byproducts of all the data science projects I have worked on. One of them is the KNIME e-learning course on data science.
That sounds like a massive project. What inspired you to embark on it?
In my experience, when running projects and working with junior data scientists, the most frequent types of questions are about the algorithms and the process. So, a few years ago, I started the basics for an e-learning course on data science. This resulted in almost 100 short videos explaining how to access data, how to train a machine learning model, which parameters to tweak, etc.
Did you encounter any challenges in the process?
There have been a number of technical challenges setting up this e-learning course, including the choice of an English voice without much of an accent. Nevertheless, the result is an almost complete basic course to introduce newbies to implementing a data science project and to some of the math behind the algorithms. And no, I have not solved my Italian accent challenge yet!
No need to, if you ask us. You’re also the author of a book on data science…
Yes. Another achievement I’m quite satisfied with is writing the “Practicing Data Science” e-book.
With all the data science projects I have implemented, it felt like a shame to keep all of this experience just to myself! So, in 2018, I started collecting the most popular use cases and their solutions in an e-book with the title “Practicing Data Science.” This is a living e-book in the sense that every time a new project is completed, a new use case and a new chapter for the book is born.
That’s why, it currently contains 22 use cases on various data domains and business cases, such as customer intelligence, cybersecurity, IoT, social media, biology, web search, and more. The e-book is sold on our website, but I would like to offer your readers a promotional code for a free download: 365-DATA-SCIENCE-0419.
Those are some serious achievements. How did you manage to do all of this by yourself?
Actually, my third biggest accomplishment is assembling a team of very talented data scientists, graphic experts, and writers: the evangelism group at KNIME. All those project solutions and other achievements were possible thanks to the collaboration among KNIME evangelists, drawing on their many different technical competences. They’ve helped me with the development of data science resources; the descriptions of general use cases; e-books, videos, whitepapers, and courses. I could not have done all of this by myself!
So, we could say that behind every great data scientist, there’s a tribe of other great data scientists.
Rosaria, you are now a principal data scientist at KNIME.com. What do you like most about your job? Are there any day-to-day operations or stages in a project you prefer to others?
I seriously adore my job. No kidding. Of course, the pressure and the deadlines are stressful to handle, like in all jobs. However, the constant learning, the feeling of self-fulfillment when a project comes to an end and the solution is actually working bring me great satisfaction. Plus, the pride when you see the surprise in the customers’ eyes at the results, and the joy of teaching others about the techniques are priceless. It’s a bit like working with magic. However, in my opinion, the best part of my job is when I start a new project.
What makes it the best part?
Well, in this exploratory phase, all options are still open. You are allowed to dream big. Reality constraints will settle in only later during project development. It is the creative phase (who said that data science is not a creative discipline?). I guess this must be similar to the feeling a writer gets in front of the blank page still to write.
We believe data science and creativity go hand in hand, too. Could you share with our readers the most surprising insight you've gained in your experience as a data scientist?
Sure. I’ve worked on several very interesting projects. The most successful one in terms of customer acceptance was about anomaly detection. How to detect anomalies when no anomaly examples are in the data set? The solution was based on predicting samples for the system working in normal conditions and calculating the distance between prediction and reality. It gave us hints as to when and how the underlying system starts to deteriorate. Recently, I have also trained a deep learning network based on LSTM units to generate rap songs, most likely the first AI-powered rap songs. That was a fun project!
This is quite the innovation! Curious to see if an AI-generated rap song will make it to the charts.
Rosaria, in your opinion, which are the top three common challenges most beginner data scientists must face and overcome in their day-to-day professional life? What has been the biggest challenge in your career so far?
In my experience, the biggest challenges when starting as a junior data scientist are attitude related. Math and algorithms can be studied and learned. Attitude is harder to acknowledge and to change. I’ve worked with many junior data scientists so far. They’ve just completed a university degree or a specialization course in data science, and they think they are done and have nothing more to learn.
So, you could say the challenge is in the expectations they have?
Yes, the first big challenge is in acknowledging that this job requires (and will keep requiring) continuous learning. There is no data scientist who knows it all. Or at least I have never met any. We all specialize in some specific techniques, data domains, or business cases. And even in what we know best, new optimization techniques, new loss functions, and new approaches often appear, and we need to learn them again. The university courses and the specialization courses all give us the capability of learning new techniques and applications quickly, but a big part of our job consists of continuous learning.
There is no data scientist who knows it all.
Another necessary change in attitude is about task organization.
A junior data scientist often thinks that their job is just to train machine learning models, which “automagically” generate fantastic insights and leave the customers in awe. Well, it is not that simple. Actually, training and applying one or more machine learning models is the easiest part. There are libraries in any tool exactly to do that. One node or one line of code will probably do the trick. The main challenge in a project comes before and after you train the model. Before that, you need to prepare the data so that they are clean and describe the problem accurately and informatively, making the model training much easier. After you train the model, you need to optimize it, as well as interpret and communicate the results.
All of these tasks are an integral part of a successful data science project and part of the data scientist job. In addition, the data scientist can help to clean the data, for example. Not only manually but by creating and proposing more efficient automatic AI-based solutions. I insist on that because bad data produce bad results, no matter how smart the machine learning algorithm. So, cleaning data or presenting results is also an important part of the data scientist job. Finally, no AI-generated insight is as admirable if it can't be communicated properly.
Finally, no AI-generated insight is as admirable if it can't be communicated properly.
I know most of us come from a science or computer programming background and might not be well-versed with words. However, communication of the final results is as important as the solution itself. Many data science solutions fail during the deployment phase, and one of the most common causes is the inability of the data scientists to effectively communicate the power of the achieved results (see blog post: “The Deployment Pain”). One of the skills junior data scientists often lack is communication, both in speaking and writing, and they will need to learn it and master it if they want their data science solutions to be successful.
With that said, I am no exception. I had to go through all of these challenges in earlier years when I was a junior graduate myself, as well as in more recent times as a principal data scientist.
Those were some profound observations that we’re sure our readers will benefit from. On a different note, how do you see data science evolving over the next 5-10 years? There are many aspiring data scientists who would definitely appreciate hearing your opinion on this.
Well, at the moment, we are witnessing the name of the data analytics field changing from data science to artificial intelligence. I see a gap and the need for more creative usage of machine learning algorithms.
Traditionally, a data science project included data preparation, training of one or more models, evaluation of these models, and finally deployment.
The goal was the prediction of a number in the future or of the class of the current data sample. It is quite a standard cyclical process.
Recently, I’ve seen solutions for new, more creative problems, such as free text generation. For example, a few deep learning models have been designed to generate language in the style of Shakespeare or rap songs, depending on the type of the training set.
Similarly, a few neural network architectures have been designed to learn and produce masterpiece-like paintings. The final goal would be to empower AI to produce more human-like skills. I expect this trend to continue over the next few years.
More technically speaking, I believe that semi-supervised learning techniques, such as active learning, will become more and more important.
Companies have collected tons of data over the past few years, which they would like to use now to get additional insights. Most of these data, however, are unlabeled, and sometimes an ontology to label them is not even available. Currently, the only way to analyze them consists of unsupervised learning techniques, such as clustering. Semi-supervised learning techniques could provide labeling options for the further application of supervised algorithms.
Reflecting further on the technical aspects, how important exactly is programming for an aspiring data scientist in your opinion?
Nowadays, programming knowledge is not necessary to become a competent data scientist. In the quite remote past, programming knowledge was necessary to clean and transform your data and to train your own machine learning algorithm. C, C++ or Java were used commonly to develop your own data analytics solution.
Nowadays, a few more comfortable options are available out there to build even better solutions. Some, like R and Python, are script based. However, some others, like KNIME Analytics Platform, rely on visual programming via a graphical user interface (GUI).
Therefore, using a GUI-based AI software tool reduces the need for programming and scripting.
That’s certainly good news for anyone without IT background willing to start a data science career. In conclusion, would you like to share some words of wisdom for data science students or practitioners starting out?
Learn the math behind the algorithms, not just how to apply them in a script.
Throughout your career, you will need to learn new tools. However, learning a new tool will be easier if you have a solid grasp of the math behind your data processing techniques.
Acknowledge that in this field you will never stop learning. It is the good and the bad thing of working as a data scientist. On one hand, your brain will never stop acquiring new concepts; on the other, you’ll need to invest time to acquire new concepts.
Take every new project as a great chance to learn something new, from a book, a colleague, experience, or something else. You never know where the next piece of new knowledge will come from!