With the growth of Big Data and the Internet of Things, data science has become the new buzzword in technology circles–especially when you consider the fact that 90% of the data we have today was created in just the last two years, it’s no wonder.
This huge influx of data brings with it growing problems of cyber security. Cyber security generates massive amounts of data that requires a cyber sleuth to understand which data represents cyber attacks, malware, potential threats, viruses, and more versus mundane everyday activities.
Some of the recent multi-pronged heists indicate that cyber-attacks are creating more damage than ever. According to Ponemon Institute’s 2013 Cost of Cybercrime Study the average annualized cost of cybercrime per company was roughly $7.2 billion. This was a 30% increase in the mean value from the previous year.
According to Todd Pedersen, a cybersecurity lead for CSC, “We used to make statements, such as ‘I have a firewall; I’m protected,’ or ‘I have antivirus software; I’m protected.’ Now, the conversation is less about preventing an attack, threat or exposure, and more about how quickly you can detect that an attack is happening.”
Context is the key for identifying legitimate attacks versus false alarms. In order to get this context, considerations for the volume, variety, and velocity of data are critical.
Volume of data
A certain amount of data volume is required in order for data scientists to accurately use machine learning to create models. If the data set is not big enough, it will force data scientists to make assumptions.
Variety of data
Variety of data normally refers to the different types of structured, semi-structured and unstructured data. Structured data could consist of security logs while unstructured would consist of data emails or social media. If you don’t have enough variety of data (i.e. only security logs), it can lead to improper modeling as well as misclassification of data.
Velocity of data
With the amount of data growing tenfold every five years, the speed at which data is being generated continues to be a problem. However, thanks to low cost computing storage it is possible for data scientists to store and analyze the growing amount of data.
With an appropriate amount of volume, variety, and velocity data scientists are able to put on their cyber sleuth hats and use analytics to separate the real threats from daily mundane activities. As data sets continue to grow with both structured and unstructured data this will become an important job for data scientists.