Data Visualisation is a very important step in machine learning, but I often ignored it. But recently I attended a session on Data Visualisation organised by WiMLDS (Women in Machine Learning and Data Science). I learnt the importance in this session.
Data Visualisation is a method of representating our data in a pictorial form. But it not just for the project to look attractive, it is a very effective way of understanding our data. As a Machine Learning beginner we often make a mistake of jumping directly to creating models and writing the code and neglect data visualisation.
Data Visualisation is the first step of any Machine Learning project. We must first visualise our data and understand it in depth in order to decide the appropriate model to be used. This process of analysing data is known as Exploratory Data Analysis (EDA).
Through EDA we get the following information about our data.
  • Distributions
  • Data quality problems
  • Duplicate
  • Incomplete data
  • Too much data which leads to over fitting.
  • Inconsistent data
  • Poor organisation of data
  • Incorrect data
  • Poorly defined data
  • Poor data security
  • Outliers
  • Corelation and inter relationship
  • Functional relationships
  • Derived attributes, keys (primary key, or foreign key. )
  • Static and dynamic attributes
Our data may have multiple features (columns) and it may get difficult to select the appropriate features. This can be made easy by sampling. In sampling we can take different parts or features of data and visualise them in order to understand each feature in depth. In this way we can choose the best possible feature for our analysis.
We can use many different types of graphs to visualise our data:
  • Bar graph
  • Pie chart
  • Histogram
  • Bubble chart
  • Scatter graph
  • Network visualisation
  • Heatmap
  • Geo data Visualization
  • Voilin plot
  • Box plot
  • Swarm plot
  • Count plot
We need to choose the appropriate graph that represents our data in the most effective way.
Some tools and libraries can help us visualize our data very easily.
Tools for data Visualisation
There are various libraries available in Python and R
Python libraries
  • Plotly
  • Matplotlib
  • Seaborn
  • Ggplot
  • Bokeh
  • Pygal
  • Altair
  • Geoplotlib for heatmaps
R libraries
  • Plotly
  • Ggplot2
  • Shiny
Knime
Knime is an open source software for visualisation and analysis of data. It helps us design the workflow. Many times it is preferred over python libraries because it provides one environment for all visualisation processes. And it is very easy to use as it provides drag and drop feature.
Google data studio

Google data studio provides templates and helps us generate detailed reports along with charts and other visualisation methods.

 

D3.js

D3.js is a great tool based on javascript that can be used to visualize our data using different shapes and animations. It helps us express our data in interesting and story like format.

 

So now that we have learnt to analyze our data, we need to learn how to find out an appropriate model for our data. I think we will get to learn that in the next workshop by WIMLDS on the topic “Hyper parameter tuning” which will be held on 25th Jan at SICSR, Pune.

I learnt so many things in a workshop by WiMLDS. WiMLDS is a great organisation to get associated with and learn Machine learning. WiMLDS has also recently collaborated with WIDS by University of Stafford and they are organising an amazing conference on Machine Learning. I will be taking in detail about these clubs and the awesome conference in my next few blogs. So stay tuned.

You can check the following websites to know more about these organisations :

https://www.widsconference.org/

http://wimlds.org/