The Role of Data in Computer Programming

In this lesson, we identify and explore some fundamental concepts that underlie computer programming. Understanding these will enable you to give clear, high-level explanations of the way that programs work and to design and help you to implement software in an effective way.

We will focus primarily on the concepts of data and data structure, and outline, only in general terms, the role of data in algorithms, architecture and applications. Specific algorithms, architectures and applications will be introduced and explored in detail later in the module.

Learning Goals

After completing this lesson you should be able to:

  • describe software in terms of the key concepts of: data, data structure, algorithm, architecture and application,

  • give examples of the wide variety of data that is involved in computer applications,

  • explain how data can be used to support various kinds of software functionality.

Key Concepts

Let us start with definitions of some key concepts:

  • Data: information that is stored in some specific format
  • Data Structure: a particular way of formatting or organising data
  • Algorithm: a specification of a computational process
  • Architecture: the overall structure of a computer program in terms of its component algorithms and data.
  • Application: a piece of software that provides one or more useful functions.

These defintinitions are intentionally very general. They refer to aspects of computer programs and programming that are often highly interconnected. However, considering them as separate aspects can make it much easier to think about and design programs that can perform very complex tasks.

What is Data ?

In science, the word data a refers to information in the form of a set of measurements or records that have been collected for reference or analysis, either directly from measurement of the world or from other data sources.

A computer program can only access and operate on information that is represented by some kind of structured format, and it cannot determine meaning or the the origin of the information it receives, except in so far as its meaning and origin may be encoded within that format. Hence, in computing, the word data can refer to any kind of information that is stored in some specific format for which there is some convention for interpreting the information stored in that format.

To be brief: data is any kind of information stored in a specific format.

Nearly all computer programs involve some kind of data processing. But this data can vary greatly in its type, quantity, complexity; and there are a huge number of ways that a computer program can operate with data.

Data Examples

The following examples illustrate the huge variety of possible kinds of data that we might want to deal with:

  • 3 numbers describing the size of a box in terms of height, width and depth.
  • a sequence of temperature measurements,
  • financial information (e.g. bank accounts),
  • a database of information about employees of a company,
  • an inventory of products and stock held by a supermarket,
  • a 3D representation of a human body,
  • the text of a book,
  • audio data (e.g. in mp3 or flac format),
  • image data (e.g. photos in jpeg format)
  • video data (e.g. in mpeg format),
  • a large repository of URLs and textual data harvested from the web.

There are many other kinds of data.

Can you think of some?

More answers (hidden item)

  • Map data
  • Birth and death records
  • Weather and climate data
  • Exam results
  • Stock market prices
  • TV Schedules and viewing figures
  • Sports results

What kinds of data are you interested in?

Representing Information: Data and Data Structures

Data is always encoded within some kind of representational system. This system enables us to interpret the data and give it meaning. The meaning of an item of data is often some fact about the world (for example the height of a building, or the birth date of a person) but it could also represent non-factual information such as a picture or sound, or an abstract mathematical object.

Whatever is represented by data, it needs to be in a format for which there are specific conventions that define its meaning. At the lowest level of detail, nearly all information handled by computers in the form binary digits (0s and 1s), which are somehow stored on an electronic medium or state. Using these as building blocks, formatting conventions are used to encode much more complex types of information. The following list gives the main types of data format in order of increasing structural complexity:

  • Binary Digits (0, 1)
  • Numbers, Characters
  • Sequences of Numbers, Strings (sequences of characters)
  • Complex data structures: lists, dictionaries, sets, trees
  • Structured data objects:
    • CSV files
    • Standard formats for images, video, audio etc
    • Structured objects defined by classes

The handling of numbers, characters, sequences and complex data structures in Python will be explored in following lessons. But, to illustrate the diverse ways in which data may be formatted, we now look briefly at two examples: CSV files and Images. You will see many other ways as the course develops.

CSV Files

CSV stands for Comma-Separated Values. The CSV file format provides a general and convenient way to store data in a similar to how it would be recorded in a table or in a spreadsheet programming. In fact the relation between spreadsheets and CSV files is so close that nearly all spreadsheet software (e.g. Excel and Gnumeric) allows data to be both imported from and exported to CSV files.

CSV files normally end in the extension .csv. Each line of the file represents a data record, which is a sequence of values, separated by (you guessed it) commas.

Here is an example of the first few lines of a file pokemon.csv:

#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1
2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1
3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1
4,Charmander,Fire,,309,39,52,43,60,50,65,1
5,Charmeleon,Fire,,405,58,64,58,80,65,80,1
6,Charizard,Fire,Flying,534,78,84,78,109,85,100,1

Here, as is often (but not always) the case, the first line is a header line. This is a comma separated sequence of headers, each of which gives a description of the type of data item in the column below.

As in most CSV data files, each record (i.e. each line apart from the header) represents a group of data values that relate to a particular item, and each of these values represents a particular attribute of that item. Also each item has the same kinds of attribute in the same sequence (corresponding to the column headers). Thus, each record/line has a definite meaning (in this case giving the vital statistics of a species of cute imaginary creature). However, you should note that the CSV format itself, does not specify how the contained information should be interpreted. The header line, if present, just associates certain characters with each column, which may or may not be informative to a programmer. Hence, when writing code that operates on data extracted from a CSV file, it is the responsibility of the programmer to ensure that the data is meaningful and is processed in a way that is appropriate to what it means.

We shall later see many examples of information in CSV files, and will also see how this type of data can easily be read into and manipulated within Python (e.g. using lists or using the powerful dataframe class provided by the pandas library).

Questions
  • The first column of the example pokemon.csv data file seems to be an ID number. But can you see something odd about it that might cause problems?

  • What if some of the data you want to store in a CSV file contains commas? For example an address might contain commas. Can we use a CSV file to store such information?

Note on file extensions

If you do not understand the term file extension (aka filename extension), you should read up on it. It is a simple concept, but will not be not explained here as you can easily find the details from other sources (such as Wikipedia). File extensions may be hidden by a file browsing tool. In particular, Windows File Explorer does not show file extensions by default. However, it is possible to change its settings to show the extension; and you are recommended to do this, since it is often useful to see file extensions when programming.

Images

In the modern age, flat-screens scintillate with images of all forms. Let us take a glance at the types of data format that are used to store and to display images.

Each image is stored in a complex digital format that can be interpreted by software and rendered as screen pixels by sending corresponding streams of digits to a computer's GPU (Graphics Processing Unit). Many different formats are used for storing images, but some are much more common than others: GIF, JPEG and PNG are currently the most common.

The image displayed on a computer screen is produced by sending a signal to its video display adapter that is generated from a special area of memory known as a Frame Buffer. The frame buffer stores a colour value for each pixel of the display in some binary form. With some simplification of actual pixel colour encoding scheems, we may assume that the colour of each screen pixel is represented by the magnitude of Red, Green and Blue light to be emitted from the pixel (this is the well-known RGB colour encoding); and each of these magnitudes is represented by a byte (corresponding to a number in the range 0-255).

Although the ways in which an image is representated are complex, and vary both depending on whether we are storing it in a file or rendering it to the screen via a frame buffer, high-level languages such as Python provide simpler ways of accessing and manipulating images. For instance, the Python module PIL (Python Image Library) defines a class of image data objects which can be:

  • created from or exported to image files of various formats,
  • displayed on the screen,
  • and modified (e.g. clipped, re-sized, rotated, etc) by means of convenient functions that can be executed within a Python program.

image

A simple image processing example using PIL

The following example code illustrates how a very simple image processing operation can be accomplished in Python by means of the high-level functionality provided by the PIL library. The code reads in the image from the PNG file images/yellow-smiley.png, replaces every sickly yellow pixel in the image by a more healthy pink pixel, and saves the result into the file images/yellow-smiley.png, as well as displaying the result using Jupyter's display function (which can be used to display a wide variety of kinds of output value in the output area of a Jupyter code cell).

Note that the colour incoding used in PIL's representation of a pixel consists of a tuple of four values, (R, G, B, A), which are each numbers in the range 0-255 and correspond to red, green, blue, and alpha values. The alpha value determines opacity of the pixel, with 0 being fully transparent and 255 being fully opaque.

Don't worry if you do not fully understand this program. It should become clear once you know more about the Python language.

from PIL import Image
smiley_image = Image.open("images/yellow-smiley.png")

(width, height) = image.size 

yellow = (255, 255,   0, 255)
pink   = (255, 200, 200, 255)

for x in range(width):
    for y in range(height):
        col = smiley_image.getpixel( (x,y) )
        if col == yellow:
            smiley_image.putpixel( (x,y), pink )

smiley_image.save( "images/pink-smiley.png" )

display( image )

When run this will give the following output:

image

Note on efficient image representation and encapsulation.

In real graphics cards, there will almost certainly be a compact encoding of colours in terms of a colour palette. This allows the colours displayed to be selected from a much larger set of possible colours. Given that most screen images will involve only a small subset of the possible colours, this allows all the required colours to be represented using a much smaller number of bits than would be required to represent every colour. Hence, less memory will be needed and the screen display can be updated more quickly.

A Python image data object may also make use of some form of compession to reduce the amount of memory required to store it in memory. But even if this is the case we can still access and manipulate the image object in a convenient way, as if it were an array of pixels each of which is located at an x, y coordinate and has a colour specified by R, G, B (and A) values. This illustrates the important concept of encapsulation.

The idea of encapsulation is that when implementing complex datatypes, it is often useful to hide the way that the data is represented from the rest of the program. For reasons of reducing memory requirements of increasing processing efficiency a datatype may be represented in a complex and unintuitive format. But when accessing and manipulating the data it should appear to have a simple intuitive form. Hence the complexity is encapsulated within the implementation of the data type rather than being allowed to affect the whole program.

The Uses of Data in Computer Programs

External Input Data vs Internal Program Data

In many coding scenarios, data is derived from a source that is external to a computer program and is then read into and manipulated by the program. Hence, from the point of view of programming, data has both an external and an internal form.

Programming langauges, provide a varity of input funcitons that allow data to be read into a program, either from streams of bytes, or, more commonly, from files. We shall later look in some detail at how Python enables data to be input from files, and how we can access and process various particular types of external data. However, for the time being we shall assume that the data we want to work with has already been input to the program and is stored as internal program data.

When considering the fundamentals of how a program manipulates its internal data, we are not particularly concerned with what the data means, or how much of it there is. So in this part of the module we will look at examples involving small amounts of very simple data, such as just a few numbers or words. Later, we shall see that same programming techniques can be applied to analysing large sets of real data containing meaningful information about the world.

We should also note that not all internal program data is derived from external data. Data is often created for book-keeping purposes, either to control algorithms or to help with the processing other data. And, in certain kinds of application, for example simulation programs, functionality may depend on processing large amounts of data that is all generated internally.

Data and Algorithms

As defined above, an algorithm is a specification of a computational process. Prior to the development of electronic computers, mathematicians specified algorithms by means of symbolic representations and rules for transforming sequences of symbols to give desired results. For example, we can multiply two numbers by writing them down as sequences of digits and then following a sequence of rules that manipulate these digits to obtain a new sequence of digits that represents the result of the multiplication.

Of course, in the age of electronic computation, the most obvious way to specify an algorithm is by a computer program. And computer languages (such a Python) provide us with convenient symbolic systems to specify algorithms, in a form that can be automatically executed by computer hardware.

An algorithm does not necessarily operate on any input data. For instance, one may define an algorithm to search for a solution to a mathematical problem (e.g. what is the largest mumber whose cube is less than 10000). But nearly all algorithms output some kind of data as a result. However, most algorithms do operate on some kind of data: in some cases just a few numbers, in other cases a huge database of statistics, or a library of images.

Typical algorithms will both operate on some input data and generate some output data. Hence, in a naive view of computation, all data is either input data or output data, with algorithms being sequences of computational steps that transform the input data into the desired output data. Although some kinds of simple computation do fit this simple picture, nearly all complex programs make use of data in a way that is highly interconnected and interleaved with algorithmic computations. Within a computation, data may be created to represent many intermediate forms between its input and output (e.g. between the stages of a data processing pipeline), and complex structures, such as hash tables, networks and trees may be built, in order to support complex processing operations. Indeed, many key algorithms of data analysis, AI and ML are based upon special-purpose data structures that are generated and manipulated during their execution.

Question

Can you think of any kinds of algorithm or software system that do not output any form of data?

Types of Computation and Application Involving Data

The central importance of data in computation stems not only from its variety of forms and the diversity of its content but also from the many different ways it can be put to use in algorithms and many kinds of application it can support.

Here are some of the main types of computation involving data, together with some typical examples:

  • manipulating data to convert it into new forms:

    • converting input data (e.g. data in a file -- perhaps compressed) into a form that is more easily operated on by a program
    • formatting an address, so it can be printed nicely on a label
    • converting a bibliographic database to a web-site
    • compressing text, audio or video, so it can be stored in a smaller file
  • analysising data to extract useful new information:

    • calculating average life expectancy from birth and death dates of a population
    • finding the most powerful earthquake recorded a dataset of seismic activity
  • use data to enable some useful functionality:

    • finding the quickest route between two places on a map
    • diagnosing an illness

Of course, these types of data-use are not completely separate. Many applications will involve both analysis and manipulation of data, as well as using it to support some functionality.

Appliction Example: a medical diagnosis system

Let us consider the example of a medical diagnosis system and the ways in which it may interact with and manipulate data:

  • The system might make use of many different forms of input data (medical histories, blood test results, genetic information, blood pressure figures, X-ray images).

  • It may perform a variety of transformations on data items (combining data from different sources, converting between different types of record, calculating derived quantities, adjusting the scale and orientation of images),

  • It will carry out various forms of analysis, such as finding abnormal measurements, or changes over time, detecting unusual image features or identifying correlations between different sets of data values.

  • Based on the results of its data manipulation and analysis, a medical diagnosis system system could support several different functions, such as: diagnosing diseses, predicting outcomes of interventions, suggesting treatment, identifying likely causes of illnesses.

Question.

What kinds of data would you expect to be generated internally within a medical diagnosis system?

Possible Answers (hide/reveal)
  • Derived values such as BMI might be computed from measured quantities.
  • Data objects might be created that grouped together multiple pieces of information relating to each individual patient or course of treatment.
  • Averages and frequencies of measurements over a population of patients might be computed.
  • Possible paths through a treatment process could be generated in the form of trees or networks.
  • Images might be scaled and rotated to standard sizes and orientations (eg. so that they can be compared to see the progress of a disease).
  • Many other forms of internal data might be created.

Exercise.

List the types of data that might be used and/or generated, the types of computation that might operate on data and any internal forms of data that might be created within the following kinds of application:

  • A route planning and navigation system (e.g. Google Maps or something similar)
  • Web based software for selling and recommending books (e.g. something like Amazon)
  • A system that simulates the interactions and population dynamics of plant and animal species in a natural environment.

(These are all quite complex kinds of software and there will be a lot of different types of data and operations that could be involved in such systems, so there is no need to make a comprehensive list. But it should be interesting to consider the possibilities.)

The Significance of Data and Data Structures in Computation

We conclude this lesson with some general observations and considerations regarding the siginficance of data and data structures in computer programming and application development. These will explore some reasons why the fields of data analysis and data science are currently dominant in driving the development of modern computer techmology.

Increasing Interest in Data and Data Structures

Traditionally, in the study of programming as an academic topic, emphasis has been placed more on algorithms than on data. Data has been conceived of as a raw material upon which algorithms work in order to produce useful results.

In commercial software design, data has for obvious reasons received more attention and different mechanisms for storing data (i.e. types of database) have been explored. But commercial computing also has also historically regarded data mainly as a resource, with focus of software design being more on the system architecture by which data interacts with other components of a complex system.

However, in recent years much more attention has been devoted to data analysis and to the development of data structures. Among the likely reasons for this are the following:

  • The development of algorithms and system architectures has reached a level
    of maturity such that it is difficult to make innovations in these aspects of software. So it is easier to improve software by focussing on data analysis and data structures.

  • Problems of consistently interpreting meaning of data have emerged in data driven applications (especially large and/or distributed systems), which require carful analysis of information to avoid.

  • Novel algorithms and architectures for data analysis and data driven functionality are being used, such as those based on Machine Learning, AI Search and Reasoning, Semantic Web Technologies. These have lead to more interest and motivation for working with complex data of many kinds.

  • Very large amounts of data are now readily available (Big Data).

The Key Role of Data Structures in Effective Programming

The importance of data structures in programming has already been stressed to the point where it may have become annoying. But just to rub it in a little more, let us conclude by considering a well-known quote from the famous originator of the Linux operating system, Linus Torvalds:

"I will, in fact, claim that the difference between a bad programmer and a good one is whether he considers his code or his data structures more important. Bad programmers worry about the code. Good programmers worry about data structures and their relationships."
(Linus Torvalds, 2006-08-28, Message to Git mailing list)

Can this be true? Surely, good programmers worry about their code: code is what they write. The data structures are defined in the code anyway, so how can you not care about the code? And surely there are very clever programmers who focus their worries on implementing efficient and reliable algorithms but may be working with quite simple kinds of data that does not require complex data structures.

Such counter-views seem to be valid. Clearly, it would be naive to take Torvalds' claim as 100% true for every programmer. Nevertheless, the claim is based on huge experience and significant insight.

As is typical with insights, the reason why good programmers tend to put more emphasis on datastructures rather than code is difficult to explain clearly and briefly.

  • A well designed data-structure can reduce the complexity of algorithms that operate on the data.

  • Data-structures, though abstract, are static, whereas algorithms are dynamic and correspond to potentially infinite sequences of program states. This makes data structures easier to mentally comprehend, so a complex data structure is less likely to give rise to bugs than a complex algorithm.

  • Use of convinent and inituitve data structures makes it much easier for information to be shared and operated on in a consistent way by different components within a program. And makes it much easier to add to and modify parts of the code without breaking the program.

Perhaps it is too early in the module for you too fully appreciate how focussing on data structures can help you to be more effective in programming. But you are signed up for Programming for Data Sceince and hopefuly, by the end of the module you will understand why this is so.

In [ ]: