In the previous blog note, we got a glimpse of how data science came about and how it is related to big data. We also looked into the major milestones of this field and why it has become popular in recent years. However, this was just scraping the surface, since data science has much to offer on many more levels. In order to get a better understanding, we will look into its history, the new paradigms it entails and the new mindset it brings about as well as the changes it brings.
History of the Data Science Field
The term “data science” was around before big data came into play (just like the term “data” preceded computers by four centuries or so). In 1962, when John W. Tukey((, he foresaw the rise of new type of data analysis that was more of a science than a methodology. In 1974, Peter Naur published a book entitled Concise Survey of Computer Methods( ) ), in both Sweden and the United States. Although this was merely an overview of the data processing methods of the time, this book contained the first definition of data science as “the science of dealing with data, once they have been established, while the relation of the data to what they represent is delegated to other fields and sciences.” So back then, anyone proficient with computers who also understood the semantics of the data to some extent was a data scientist. No fancy tools, no novel paradigms, no new science behind it. It’s no surprise that the term took a while to catch on.) wrote his book The Future of Data Analysis
As computer technology and statistics started to converge later that decade, Tukey’s vision began to materialize, albeit quite subtly. It wasn’t until the late 1980s, though, that it started to gain ground through one of data science’s most well-known methods: data mining. As the years advanced, the scientific processing of data rose to new heights, and data science came into the spotlight of academic research through a conference in 1996 called “Data Science, Classification, and Related Methods.” This conference, which was organized by the International Federation of Classification Societies (IFCS), took place in Kobe, Japan. It made data science more well-known to the circles of researchers and distinguished it from other data analysis terms, such as classification, which are not as broad as data science. This helped gradually make data science an independent field.
In the next year (1997), the Data Mining and Knowledge Discovery journal was launched, defining data mining as “extracting information from large databases.” This was the first data science method to gain popularity and respect in the scientific community as well as in the industry.
The role of data science started to become more apparent at the end of the 1990s as databases grew larger. This was voiced very eloquently by Jacob Zahavi in December 1999 in his article “Mining Data for Nuggets of Knowledge”: “Conventional statistical methods work well with small data sets. Today’s databases, however, can involve millions of rows and scores of columns of data… Scalability is a huge issue in data mining. Another technical challenge is developing models that can do a better job analyzing data, detecting non-linear relationships and interaction between elements… Special data mining tools may have to be developed to address web-site decisions.” This depicted very clearly how the need for a new framework of data analysis was imperative, something that aided in the coming about of data science as a field to address that need.
In the 2000s, publications about data science started to appear at an increasing rate, though they were mainly academic. Journals and books on data science became more common and attracted interest among researchers. In September 2005, the term “data scientist” was first defined (albeit somewhat generically) in a government report. Later on, in 2007 the Research Center for Dataology and Data Science was established in Shanghai, China.
2009 was a great year for data science. Yangyong Zhu and Yun Xiong, two of the researchers in the aforementioned research center, declared in their publication “Introduction to Dataology and Data Science,” that data science was a new science, distinctly different from natural science and social science. In addition, in January of that year, Hal Varian (Google’s Chief Economist) stated for the press that the next sexy job in the coming decade would be statisticians (a term sometimes used for data scientists when addressing people who are not entirely familiar with the topic). Finally, in June of that year, Nathan Yau’s article “Rise of the Data Scientist” was published on FlowingData, making the role of the data scientist much more familiar to the non-academic world.
In the current decade, data science publications have become abundant, although there is still no decent source of information about how to effectively become a data scientist apart. The term “data science” gained a more concrete definition, the essence of which was summarized in September 2010 by Drew Conway’s Venn diagram (Fig. 1).
Fig. 1. Conway’s Venn diagram about Data Science.
This diagram illustrates the key components of data science as well as how it differs from the field of machine learning and traditional research. By “danger zone” he probably means the hackers/crackers that compromise the security of many computer systems today. Image source: Drew Conway.
His quote provides further understanding of the fundamentals for becoming a data scientist: “…one needs to learn a lot as they aspire to become a fully competent data scientist. Unfortunately, simply enumerating texts and tutorials does not untangle the knots. Therefore, in an effort to simplify the discussion, and add my own thoughts to what is already a crowded market of ideas, I present the Data Science Venn Diagram… hacking skills, math and stats knowledge, and substantive expertise.”
Finally, in September of 2012, Hal Varian’s quote about this decade’s sexy job grew into a whole article in Harvard Business Review (“Data Scientist: The Sexiest Job of the 21st Century”) making an even larger population aware of the importance of the role of the data scientist in the years to come.
It is noteworthy that parallel to these publications and conferences, there has been a lot of online social activity in terms of data science. The first official data science group was created on LinkedIn in June 2009 (known as Data Scientists group), and currently also has an independent site (datascientists.net as well as datascientists.com, its original name). Other data science groups have been available online since 2008, although as of 2010, their number has risen at an increasing rate along with online postings for data scientist jobs. It should also be noted that over the past few years, there have been a lot of non-academic conferences on data science. These conferences are usually rich in workshops and are targeted at data professionals, project managers and executives.
The New Paradigms of Data Science
Data science has brought about or popularized some new paradigms that constitute great tools for any data professional. The main ones are:
- MapReduce – A parallel, distributed algorithm for splitting a complex task into a series of simpler tasks and solving them in a very efficient manner, thus increasing the speed of performing the complex task and lowering the cost of computing resources. Although this algorithm existed before, its wide use in data science has made it more well known.
- Hadoop Distributed File System (HDFS) – An open-source platform designed to make use of parallel computing technology, it basically makes dealing with big data manageable by breaking it into smaller chunks that are split over a network of computers.
- Advanced Text Analytics – Often referred to as Natural Language Processing (NLP), this is the field of data analysis that involves techniques for processing unstructured textual data to extract useful information and business intelligence from it. Before data science, this field didn’t exist at all.
- Large scale data programming languages (e.g., Pig, R, ECL, etc.) – Programming languages that work with large datasets (especially big data) in an efficient manner. These were underdeveloped or completely absent before data science appeared.
- Alternative database structures (e.g., HBase, Cassandra, MongoDB, etc.) – Databases for archiving, querying and editing big data using parallel computing technologies.
You may be familiar with the New Technology File System (NTFS) employed by every modern Windows OS. This is a fairly satisfactory file system that works without too many problems for most PCs. It would be impossible to use in a network of connected computers, for handling large amounts of data, however; NTFS has a limit of 256 TB, which is insufficient for many big data applications. Unix-based file systems face similar restrictions, which is why when Hadoop was developed, a new type of file system had to be created, one that was optimal for a computer cluster. HDFS allows the user to view all the files on the cluster and perform some basic operations on them as if they are on a single computer (even if most of these files are scattered across the entire network).
At the heart of Hadoop lies MapReduce, which is the paradigm that enables the network to crunch the data efficiently with limited risk of failure. All the data is replicated in case one of the computers of the cluster (usually called nodes) fails. There are a number of supervising nodes that are in charge of scheduling the tasks and managing the data flow. First, all of the data is mapped through a set of cluster nodes referred to as mappers. Once it is processed by the mappers, a set of nodes undertakes the task of reducing the resulting processed data into more useful outputs. This set of nodes, referred to as reducers, may include mappers that have finished their job as well. Everything is coordinated by the supervising node(s), ensuring that the outputs of every stage are stored securely (in multiple copies) across the cluster. Once the whole process terminates, the outputs are provided to the user. The MapReduce paradigm involves a lot of programming that can be quite tedious. Its big advantage is that it ensures the process finishes relatively quickly, making efficient use of all available resources, while at the same time minimizing the risk of data loss through hardware failure (something quite common for the largest clusters).
Text analytics have been around for a while, but data science introduced some advanced techniques that make the previous techniques seem almost primitive. Modern (advanced) text analytics allow the user to process large amounts of text data, pinpointing patterns in them very quickly while allowing for common problems such as misspelled words, multi-word terms split over a sentence, etc. Advanced text analytics may be able to pinpoint sentiment (!) in social media posts, recognizing if someone’s comments are literal or sarcastic, something that is extremely difficult for a machine to accomplish without the use of these advanced methods. This advancement was made possible via the application of artificial intelligence algorithms in a Hadoop environment.
Large scale data programming languages, such as Pig, R, and ECL, were developed to tackle big data and integrate well with the Hadoop environment (actually, Pig is part of the Hadoop ecosystem). R, which was developed before the advent of big data, underwent a major upgrade that allows it to connect with Hadoop and handle files in HDFS. As programming languages are not too difficult to develop nowadays, other new languages in this category have been developed, so it is good to keep your eyes open. By the end of this decade, it is possible that the current languages will no longer be the first choice for a data scientist (although it is quite likely that R will be around for a while due to its immense user community).
New alternative database structures came about thanks to data science. These structures include Hash Table (e.g., JBoss data grid, Riak), B-Tree (e.g., MongoDB, CouchDB), and Log Structured Merge Tree (e.g., HBase, Cassandra). Unlike traditional databases, these types of schemas are designed for big data, so they are very flexible in how they read/write data records in a database. Each has its own advantages and disadvantages, but they are all better than traditional SQL databases, which fail when the number of records or the number of fields increases beyond a certain level. For example, if you have a very large database (big data warehouse) consisting of a million fields and a billion records, finding a simple maximum value of a given field using a traditional database will take longer than anyone is willing to wait. The same query in a columnar database (e.g., HBase) will take a fraction of a second.
All of these paradigms are based on the notion that a team of computers, in the form of a cluster, work significantly better than any single (super)computer, given that there are enough members in that team. The innovation lies in the intelligent and customized approaches to planning the essential tasks so that they are efficiently handled by the computer cluster; in essence, optimizing the process of dealing with the problem at hand. It is no coincidence that these paradigms have exhibited increased popularity since their creation and that they continue to evolve rapidly. There is a lot of interest (and money) invested in these technologies; learning them now is bound to pay off in the near future.
The New Mindset and the Changes It Brings
By now, you’ve probably figured out that data science is not merely a set of clever tools, methodologies, and know-how. It is a whole new way of thinking about data altogether. Naturally, this paradigm shift brings about certain changes in the way people work on related projects, how they engage with the problems at hand and on how they develop themselves as professionals.
Data science requires us to think more systematically, combining an imaginative approach to problems with solid pragmatism. This translates into a way of thinking that resembles that of a good civil engineer, combining an artistic perspective (through design) with hard-core engineering and time management. Planning is a crucial aspect of working with big data as different ways of doing the same task may have vastly different demands on resources without any significant difference in the results.
The changes this new mindset brings are evident in the way a data scientist functions. The data scientist usually works as part of a varied team consisting of data modelers, businesspeople, and other professionals (depending on the industry). It is very rare to see a data scientist work on his own for long periods of time as a traditional waterfall model programmer would, for instance.
In addition, the data scientist handles problems by taking advantage of current literature, connecting with a variety of professionals who may be more knowledgeable on the problem he is facing, and breaking problems down into manageable sub-problems that he gradually solves.
The skills a data scientist needs to be successful are not uncommon individually. A data scientist should be able to learn new things easily. With the fast pace of development of big data technologies, a data scientist must have an agile mind that is quick to grasp new methods and familiarize itself with new tools.
A data scientist must also be proactive, anticipating things that will be needed in his work, problems that may arise, and anything else that will require his time. Existing methods may need to be fine-tuned or customized for the problem at hand, and changes in the method may be needed.
A data scientist needs to be flexible, adapting easily to a new business domain, new team members, and new tools (the software he uses when starting a job may be quite different from what he ends up using later in that job). He needs to be adept at networking and should understand the value of the skills he is missing so he takes steps to develop them. Overall, almost all of the skills that a data scientist has are highly transferable and applicable to a large variety of situations. As a result, he is a potent professional who can be an asset to any team, especially an IT one.
- Data science is older than most people think, but it only started gaining ground in the past decade (2000s).
- Drew Conway’s well-known Venn diagram, created in September 2010, effectively summarizes the essence of data science.
- Data science has brought about some new paradigms that change the way we deal with data, the main ones being:
- Hadoop Distributed File System (HDFS)
- Advanced Text Analytics
- Large scale data programming languages (e.g., Pig, R, ECL, etc.)
- Alternative database structures (e.g., HBase, Cassandra, MongoDB, etc.)
- Data science’s paradigm shift in the way we deal with data caused certain important changes in our lives as data professionals as it brought about a whole new mindset that is essential for dealing with big data.
- The new mindset that data science promotes brings about several changes in the data scientist’s professional life and in the way he interacts with others.