Data, Data, Data!

According to an IBM study, we now generate an astounding 2.5 quintillion bytes of data per day. Bits of information are swarming from digital transmittals and transactions, social media usage, photo/audio/video uploads, downloads and shares, cell phone signals, satellite intelligence, weather radar, GPS signals, and much more. It’s stored on hard drives, flash drives, servers, and clouds. It’s transmitted via satellite, cellular, broadband, Wi-Fi, fiber optics, and more. Collectively, this is big data, which encompasses the analysis of all the data in a dataset.

Up until recently, software engineers and data analysts have been able to keep up with the tasks of storing, sifting, and mining all this data. But the current deluge has grown too large and too complex, and this, coupled with a decrease in the cost of hardware and software that allows companies to capture and analyze big data, has ushered in a new wave of information experts we now know as data scientists.

Data Science vs. Data Analytics

Data scientists have become a logical extension of data analysts. The latter are engineers who typically use SQL in order to gather information from a database, then use Excel or SAS programming software to manipulate or model the data, and create data visualizations. This is a common function at large companies as they look to extrapolate useful information from consumer data.

A data scientist typically has a more robust education and/or training in computer science and a deeper knowledge of programming languages, databases, machine learning, and sophisticated data mining and data modeling techniques. They also typically possess business training and have the communications skills to present their findings, as well as the design skills necessary to design the entire project or experiment. Data analysts may work within a team that is often led by a data scientist.

In the modern business analytics scheme, data is provided by a data analytics team. The data scientist then determines how that data will be structured and stored, which algorithms will be used to recall and analyze that data, and which research questions the overall project is trying to answer. These overarching questions are based on real-world business needs. Performing this techno wizardry requires a vast skill set that may include:

  • Familiarity with database systems such as SQL interface, ad-hoc, MySQL, etc.
  • Exposure to Hadoop platform-based analytics solutions (HBase, Hive, Map-reduce jobs, Impala, Cascading, etc.)
  • Programming expertise with Java, Python, and simple map-reduce jobs development
  • Exposure to various analytics functions, such as median, rank, and over
  • Expertise in mathematics, statistics, and correlation
  • Expertise in data mining and predictive analytics
  • Proficient in a variety of programs, including R or RStudio, Excel, SAS, IBM SPSS, and MATLAB
  • Expertise in data model development
  • Experience with very large data sets and visualization
  • Familiarity with machine learning and data mining algorithms (Mahout, Bayesian, and Clustering)
  • Familiarity with data warehousing and business intelligence concepts
  • Exposure to enterprise commercial data analytical stores (Vertica, Greenplum, Aster Data, Teradata, Netezza, etc.)

Acquiring these high-level technology skills typically requires a master's degree or doctorate in data science from a top-tier university, such as a Master’s degree of Information and Data Science with our partner UC Berkeley.