2.2 Data Compression

Essential Knowledge:

  • Data compression can reduce the number of bits, or size, of transmitted or stored data
  • Fewer bits does not necessarily mean less information
  • The amount of size reduction from compression depends on both the amount of redundancy in the original data representation and the compression algorithm applied
  • Lossless data compression algorithms can usually reduce the number of bits stored or transmitted while guaranteeing complete reconstruction of the original data
  • Lossy data compression algorithms can significantly reduce the number of bits stored or transmitted but only allow reconstruction of an approximation of the original data
  • Lossy data compression algorithms can usually reduce the number of bits stored or transmitted more than lossless compression algorithms
  • In situations where quality or ability to reconstruct the original is maximally important, lossless compression algorithms are typically chosen.
  • In situations where minimizing data size or transmission time is maximally important, lossy compression algorithms are typically chosen.

What is Data Compression?

  • Data compression is a reduction in the number of bits needed to represent data
  • Data compression is used to save transmission time and storage space

How Does Compression Work?

  • When data is compressed, you are looking for repeated patterns and predictability
  • Example: XXXOOXX = X3O2X2
  • The larger the data file, the more patterns that can be pulled out

Text Compression

  • Remove all repeated characters and insert a single character or symbol in its place
  • Example: Twinkle Twinkle Little Star = Little Star

Data Compression Methods: Lossless Vs. Lossy

  • Lossless: Reduces the number of bits stored or transmitted while guaranteeing complete reconstruction of the original data. This is used where the removal of small data changes the overall file.
  • Lossy: Significantly reduces the number of bits stored or transmitted but only allow reconstruction of an approximation of the original data. This is used where the removal of small data removes redundancy.
  • Example: A detailed sky with multiple shades of blue will be averaged out to one shade of blue in the JPG algorithm.

Practice Problem

The text given in the practice problem decodes to: Humpty Dumpty sat on a wall, Humpty Dumpty had a great fall, all the king's horses and all the king's men, Couldn't put Humpty together again.

2.3.1 - Extracting Information from Data

Essential Knowledge

  • The ability to process data depends on the capabilities of the user and their tools
  • Data sets pose challenges regardless of size, such as the need to clean data, incomplete data, invalid data, and the need to combine data sources
  • Depending on how data were collected, they may not be uniform
  • Cleaning data is a process that makes the data uniform without changing their meaning
  • Problems of bias are often created by the type or source of data being collected. Bias is not eliminated by simply collecting more data
  • The size of a data set affects the amount of information that can be extracted from it
  • Large data sets are difficult to process using a single computer and may require parallel systems
  • Scalability of systems is an important consideration when working with data sets, as the computational capacity of a system affects how data sets can be processed and stored.

Where do we start with data?

  • Collecting data
    • Issues to consider:
      • Source
        • Do you need more sources?
      • Tools to analyze data
  • Processing data is affected by size
    • How much information can we get?
    • Can one computer handle our task?
      • May need to use parallel processing
        • Use two or more processors to handle different parts of the task
        • Check the scalability of the system.
  • Is there potential bias?
    • Intentional:
      • Who collected the data?
      • Do they have an agenda?
    • Unintentional?
      • How is the data collected?
      • Who collected the data?
  • Data Cleaning:
    • Identifying incomplete, corrupt, duplicate, or inaccurate records
    • Replacing, modifying, or deleting the "dirty" data
  • Be careful about modifying or deleting!
    • Be sure there is a mistake!
    • Keep records of what data is modified/deleted and WHY
    • Invalid data may need to be modified - keep form consistent

Practice Problems:

From the sample data shown in the video, here is the data that needs to be modified:

  • The 10th student ID is missing
  • The 5th and 9th years in school are not numbers like the rest of the column
  • The 2nd and 8th GPA's need to be two decimal places, having zeroes where need be.
  • The sixth year in school and the fifth GPA are inaccurate. There is no way a high school student is in Year 20, and we can assume that it is highly unlikely that a GPA higher than 4.0 is possible in this scenario

2.3.2 - Extracting Information from Data

Essential Knowledge

  • Information is the collection of facts and patters extracted from data
  • Data provides opportunities for identifying trends, making connections, and addressing problems
  • Digitally processed data may show correlation between variables. A correlation found in data does not necessarily indicate that a causal relationship exists. Additional research is needed to understand the exact nature of the relationship.
  • Often a single source does not contain the data needed to draw conclusion.
  • Metadata are data about data
  • Changes and deletions made to metadata do not change the primary data
  • Metadata are used for finding, organizing, and managing information.
  • Metadata can increase the effective use of data or data sets by providing additional information
  • Metadata allow data to be structured and organized.

What is metadata?

  • Prefix meta
    • behind, among, between
    • metadata - data about data
  • Some data has information about itself
    • Author
    • Date
    • Length/Size
  • Why?
    • Identify
    • Track/Organize
    • Process
  • Example: A photo includes:
    • Date
    • Time
    • Location
    • Height
    • Width
    • Pixels
    • Type of compression

Practice Problems

Metadata in the book mentioned in the video:

  • Author
  • Title
  • Genre
  • Date of Publication
  • Length

How can it be used?

  • Suggest other books
  • Organize inventory

Finding Patterns

  • Data allows us to gain knowledge:
    • Look for trends
    • Look for patterns
    • Answer questions
  • Be Careful!
  • Look out for misleading trends/patterns
  • Correlation does not equal causation

What I Took Away

  • Data compression - lossless for keeping as much data as possible, lossy for removing all small details - is used to transmit or store data with less bits for storage and time complexity.
  • Extracting information from data requires programmers to go beyond the data set, using tools such as metadata and additional resources.