The smallest form of data that a computer can work with is the binary digit. This binary digit can assume only one of two possible values: 0 or 1. The term binary digit is truncated to bit. The bit is thus used to describe the smallest data item in the data hierarchy.
The binary number system is based on the bit (binary digit).
The next level in the data hierarchy is the character. A character can be a numeral digit, alphabetical letter, or any of the special symbols that are used in mathematics and grammar.
Each character has its own unique bit value e.g 5 has a bit value of 101.
The two main character sets used by programs are Unicode (Standard) and the American Standard Code for Information Interchange (ASCII). ASCII is considered a subset of Unicode due to its limited character set, with all its characters included in the Unicode Standard.
When characters are grouped together to convey a meaningful idea, it creates a field. Examples of a field include a name, date, and program syntax.
Related fields are grouped together to create a record. For example, the name of a person, his date of birth, current location, and government-issued identification number compose a record known as the Personally Identifiable Information (PII) record.
Related records are grouped together to create a file.
Data need to be stored in a database so that it can be easily retrieved for manipulation and subsequent storage.
There are different types of database e.g non-relational database and relational database that organizes data in tables.
Machine, Assembly, and High-Level Languages
Machine language is hardware-specific because it is the only programming language that the computing device can understand and implement. It is normally composed of characters that are ultimately reduced to bits in the central processing unit (CPU) so that they can be worked on by the arithmetic logic unit (ALU).
Programming in machine language is the most tedious form of programming. In fact, assembly language was invented to deal with this issue. In assembly language, English-like fields were introduced, initially as abbreviations that represented operations, hence accelerating the rate of program development.
Assembly language needs to be translated into machine language for it to be executed by the computer. The program that does this translation is called the assembler. Assemblers usually operate at computer speed.
The main disadvantage of assembly language (to programmers) was that it required many instructions to execute a simple task. This led to the development of a high-level programming language that is closer to the English syntax and uses fewer instructions per task as compared to the alternative assembly language.
In the average high-level language, a statement can accomplish a task.
To translate the high-level language into its corresponding machine language, a special program called a compiler is used. This process of translation is called compilation.
The interpreter needs to be installed on the computer in order for the high-level program to run. Normally, the interpreted program runs at a slower speed compared to its equivalent compiled program.
History of C Programming Language
In 1967, Martin Richards developed a programming language called Basic Combined Programming Language (BCPL) for use in writing compilers and operating systems for computers.
In 1969, Dennis Ritchie and Ken Thompson developed the B programming language that was modeled after BCPL. While at Bell Laboratories in 1970, Thompson used this language to develop an operating system (OS) called UNIX. Even so, the utility of this UNIX OS was limited by its fastidiousness to specific hardware profiles.
There was a need to develop an operating system that was hardware-independent, and this required a more evolved programming language to create this OS. In 1972, Dennis Ritchie developed the C programming language from the predecessor B programming language.
In 1978, Dennis Ritchie and Brian Kernighan published the first textbook titled The C Programming Language which taught learners how to program using this high-level programming language. This popularized the programming language thus increasing its uptake in the field of computer science.
In 1983, there existed different versions of incompatible C programs, with most of this incompatibility traced to hardware specificity of the programs. This means that a C program created for one hardware platform could not be used in another hardware platform. This led to the quest to create a standardized C programming language which was created and adopted in 1989 as Standard C.
Standard C was approved by the American National Standards Institute (ANSI) and the International Standards Organization (ISO).
The latest Standard C was approved in 2011 as C11, and was later updated in 2018 as C18. Its current documentation is the ISO/IEC 9899:2018.
Other programming languages that developed alongside C are:
Beginners’ All-Purpose Symbolic Instruction Code (BASIC) was developed in 1964 at Dartmouth College by John Kemeny and Thomas Kurtz to teach programming techniques.
C++ was developed in 1985 by Bjarne Stroustrup at Bell Laboratories. It extends the capabilities of C through object-oriented programming (OOP).
Objective-C was developed by Brad Cox and Tom Love in 1984. It extended C through OOP, and was used by Next Incorporated to develop its NeXTSTEP OS which utilized the Mach kernel. NeXTSTEP evolved into macOS, which used Objective-C as its programming language. This language was also used in Apple’s mobile OS called iOS until 2014 when it was replaced with Swift. Presently, Swift includes features from Python, Java, C#, and Ruby.
Python debuted in 1991 as an OOP language with easy code readability due to its closer syntax to English grammar as compared to C/C++. It was developed by Guido van Rossum.
Java was developed in 1991 by James Gosling of Sun Microsystems as a C++-based OOP that was hardware agnostic (i.e it can run in almost any hardware platform hence its mantra “write once, run anywhere[WORA]”). It was designed to address the issue of hardware specificity that affected C++ programs.
C# was developed by Microsoft as a programming language based on Java and C++. It is used to program applications that need to communicate with the web.
R programming language was developed in 1993 by Robert Gentleman and George Ross Ihaka as a dynamic object-oriented language for building and running statistical applications that support data visualization.
The C Standard Library is a collection of pre-programmed, ready-to-use functions.
This library eases the process of software reuse.
Program Development Environment
The standard C system is made up of 3 components:
C standard library.
Program development environment.
The process of building a C program has 6 phases:
Editing: The program source code is created in an editing software that is generically described as the editor. The editor can be a standalone editor like vi or emacs, or an integrated development environment (IDE) like Xcode or Visual Studio. The source code is made up of characters only. The edited file is saved with the .c extension.
Preprocessing: It is executed by the C preprocessor which is part of the compiler program. The preprocessor executes a set of commands designated as preprocessor directives which modify the source code of the program. These directives allow for the replacement of some texts and the inclusion of new content from external files.
Compilation: The compiler translates the preprocessed source code into a machine language code known as the object code. If the compiler encounters an unrecognized statement in the source code, or any statement that violates the rules of the language, then it stops the compilation and issues an error message described as a compile-time error, syntax error, or compile error. This establishes C as a compiled programming language.
Linkage: The source code uses functions that are stored in the C standard library, or an open source library, or an external private C library. These functions create gaps in the source code that need to be filled by linking this code with the library that contains the referenced functions. The program that links the object code with the appropriate library is called a linker, and it outputs an executableimage. By default, this executable image has a .out extension. This executable file can be run on a hardware platform.
Loading: This is the process where the executable image is put into the primary memory (i.e RAM) in readiness for execution. This process is done by a loader that takes the executable image and related files (from shared libraries) from the secondary storage unit (e.g hard disk) and transfers them into the memory unit. The files from the shared libraries that are needed for the executable image to be run are called dependencies.
Execution: The program files and shared libraries’ components in the memory unit are executed by the CPU in a sequential manner, i.e the first instruction needs to be executed before the next instruction is executed. Errors that occur during this phase are called execution-time errors or runtime errors, and some of them e.g division-by-zero result in fatal errors that lead to program termination.
The C program processes data; and this requires data input to feed data into the program (for processing), and the processed data is then presented as the data output. Some C functions use the standard input stream, represented as stdin, as the source of data input. Normally, the keyboard is the stdin. Likewise, some functions deliver their data output to the standard output stream, represented as stdout, which is usually the monitor. If an error occurs, then it can be outputted to the standard error stream, represented as stderr, which is normally the monitor that displays an error message.
Internet and the World Wide Web (WWW)
In 1957, the Soviet Union launched an artificial (earth) satellite into the elliptical low earth orbit. This technological feat surprised the United States Government (USG) which at that time lacked a homegrown artificial satellite that could be launched into space. To remedy this situation, President Dwight David Eisenhower created a specialized research agency on February 7, 1958, called the Advanced Research Projects Agency (ARPA).
ARPA funded universities and research institutions across the United States, and even provided them with mainframe computer systems. In 1966, ARPA launched project ARPANET to connect these different mainframe computer systems together so as to allow for easy and streamlined communication among them, as well as enable remote access to any of the connected systems.
The first batch of mainframe computers was successfully connected together in 1969 to form a network called ARPANET. In 1970, the protocol stack of ARPANET was improved through the implementation of the Network Control Protocol (NCP) thus making this network to become fully operational in 1971.
In packet switching, the data to be transferred is first broken up into small data pieces called packets, which are then numbered sequentially. It is these packets that are moved across the network, and are then regrouped together at the destination computer into the original data.
Initially, ARPANET achieved a line speed of 2400 bits-per-second (bit/s) which was faster than the existing telephone line speed of 110 bits/s. By 1970, ARPANET had upgraded its line speed (or data transfer rate) to 50000 bits/s or 50 kilobits-per-second (Kb/s). These initial network speeds were very high for their age, and this made further research in communication networks a viable field. This research paid off as attested to by the recent feat achieved in 2020 in Australia where internet connection speed reached 44.2 Terabits-per-second (Tb/s) or 44,200,000,000 Kb/s.
ARPANET allowed researchers to communicate easily with each other using the electronic mail (e-mail). It is this capability to scale communication and easily deploy its communication infrastructure, as well as make this infrastructure available to many people, that drove the evolution of ARPANET into the Internet.
ARPANET allowed for networking hardware and software to evolve and become available to organizations. This allowed organizations to develop their own computer networks. In the organization, this network that connected its computers together, usually in the same building or facility, was called the intranet.
The intranet allowed for intra-organizationcommunication. Still, organizations needed to communicate with each other and share information and emails, and thus there was a need for inter-organization communication.
ARPA developed the Internet Protocol (IP) to allow for inter-organization communication, thus establishing a network of intranets that developed to become the modern internet.
The Internet allowed businesses to scale their operations and easily reach out to their existing customers, as well as expand their customer base. In turn, this increased the load on the existing internet infrastructure, which automatically created a demand for more infrastructural resources. This demand could best be met by increasing the information-carrying (or data carriage) capacity of existing communication lines. The information-carrying capacity of a communication line is called its bandwidth . Therefore, increased demand for internet infrastructure led to increased bandwidth in the evolving infrastructure. Increased bandwidth allowed users to easily access files stored on remote computers.
This collection of networking hardware, software, communication protocols, and computers that allow information to be shared across the internet is called the World Wide Web (WWW).
Among the files shared on the World Wide Web were documents, which usually contained text only.
In 1980, Tim Berners-Lee while working at the European Organization for Nuclear Research (CERN) developed ENQUIRE, a hypertext software that was aimed at addressing the shortfalls of the 2 existing hypertext systems – Memex and NLS (oN-Line System). ENQUIRE was written in Pascal programming language.
Hypertext is any text that references another text that can be accessed. Each reference in this hypertext is called a hyperlink because it is a link that connects the hypertext to its referenced resource.
In ENQUIRE, each document was called a card, and each card contained hyperlinks that allowed one to access other cards. This allowed for cards to be linked (together) with each other – by bidirectional hyperlinks – in a database. The main shortfall of ENQUIRE was that it was not user-friendly and therefore not many co-workers in CERN could create new cards and link them to the existing card database. Moreover, one could not link to external databases.
In 1989, Berners-Lee started developing a novel hypertext information system that could be used to share information through hyperlinked documents. He also developed a communication protocol for this system which he called the HyperText Transfer Protocol (HTTP). In 1993, the developed hypertext information system was released as the HyperText Markup Language (HTML) and featured a standardized markup language.
A markup language is a text-encoding system that requires unique sets of symbols to be inserted into a text document so as to control how it is displayed, structured, and formatted, as well as describe relationships between different parts of this document.
In HTML documents, the markup language allows for automated processing and presentation of the document as a webpage (by a web browser).
In 1994, Tim Berners-Lee established the World Wide Web Consortium (W3C) to develop web technologies and make them universally available.
Introduction to Big Data
In 2016, IBM reported that 2.5 Exabytes (or 2,500,000 Terrabytes[TB]) of data is created daily, mostly by internet users. IDC projects that this data volume will increase to 175 Zettabytes (1,750 Exabytes) per annum by 2025.
As a rule of thumb, the more the data, the more computing power is needed to process it. Likewise, the more computing power needed, the more energy is needed (to satisfy its power needs). For instance, the energy required to process Bitcoin transactions in 2022 is estimated to be 204.5 TWh (TerraWatts-per-hour) which exceeds the annual energy consumption of several nations.
The performance of a central processing unit or processor is measured using one of two measures – instructions per second (IPS) and floating point operations per second (FLOPS). FLOPS provides a better measure than IPS.
Fugaku supercomputer made by Fujitsu managed to reach 442 x 1015 FLOPS, i.e 442 PetaFLOPS, in 2020. The distributed computing network for drug design and disease research called Folding@home had already reached 100 petaflops in 2016.
The Quantum computer is projected to execute more instructions in one second as compared to all the instructions executed by standard conventional computers since the computer was invented.
In 1962, the term data analysis was coined by John Tukey in his paper, The Future of Data Analysis. It was principally tied to statistics.
In 1987, the term big data was coined and was then popularized by John Mashey.
Big Data is associated with the 5Vs, which are Volume, Velocity, Variety, Veracity, and Value.
The ability to derive insights and value from Big Data forms the basis of Data Analytics.