2.2 Data Compression

Essential Knowledge:

Data compression can reduce the number of bits, or size, of transmitted or stored data
Fewer bits does not necessarily mean less information
The amount of size reduction from compression depends on both the amount of redundancy in the original data representation and the compression algorithm applied
Lossless data compression algorithms can usually reduce the number of bits stored or transmitted while guaranteeing complete reconstruction of the original data
Lossy data compression algorithms can significantly reduce the number of bits stored or transmitted but only allow reconstruction of an approximation of the original data
Lossy data compression algorithms can usually reduce the number of bits stored or transmitted more than lossless compression algorithms
In situations where quality or ability to reconstruct the original is maximally important, lossless compression algorithms are typically chosen.
In situations where minimizing data size or transmission time is maximally important, lossy compression algorithms are typically chosen.

What is Data Compression?

Data compression is a reduction in the number of bits needed to represent data
Data compression is used to save transmission time and storage space

How Does Compression Work?

When data is compressed, you are looking for repeated patterns and predictability
Example: XXXOOXX = X³O²X²
The larger the data file, the more patterns that can be pulled out

Text Compression

Remove all repeated characters and insert a single character or symbol in its place
Example: Twinkle Twinkle Little Star = Little Star

Data Compression Methods: Lossless Vs. Lossy

Lossless: Reduces the number of bits stored or transmitted while guaranteeing complete reconstruction of the original data. This is used where the removal of small data changes the overall file.
Lossy: Significantly reduces the number of bits stored or transmitted but only allow reconstruction of an approximation of the original data. This is used where the removal of small data removes redundancy.
Example: A detailed sky with multiple shades of blue will be averaged out to one shade of blue in the JPG algorithm.

Practice Problem

The text given in the practice problem decodes to: Humpty Dumpty sat on a wall, Humpty Dumpty had a great fall, all the king's horses and all the king's men, Couldn't put Humpty together again.

2.3.1 - Extracting Information from Data

Essential Knowledge

The ability to process data depends on the capabilities of the user and their tools
Data sets pose challenges regardless of size, such as the need to clean data, incomplete data, invalid data, and the need to combine data sources
Depending on how data were collected, they may not be uniform
Cleaning data is a process that makes the data uniform without changing their meaning
Problems of bias are often created by the type or source of data being collected. Bias is not eliminated by simply collecting more data
The size of a data set affects the amount of information that can be extracted from it
Large data sets are difficult to process using a single computer and may require parallel systems
Scalability of systems is an important consideration when working with data sets, as the computational capacity of a system affects how data sets can be processed and stored.

Where do we start with data?

Collecting data
- Issues to consider:
  - Source
    - Do you need more sources?
  - Tools to analyze data
Processing data is affected by size
- How much information can we get?
- Can one computer handle our task?
  - May need to use parallel processing
    - Use two or more processors to handle different parts of the task
    - Check the scalability of the system.
Is there potential bias?
- Intentional:
  - Who collected the data?
  - Do they have an agenda?
- Unintentional?
  - How is the data collected?
  - Who collected the data?
Data Cleaning:
- Identifying incomplete, corrupt, duplicate, or inaccurate records
- Replacing, modifying, or deleting the "dirty" data
Be careful about modifying or deleting!
- Be sure there is a mistake!
- Keep records of what data is modified/deleted and WHY
- Invalid data may need to be modified - keep form consistent

Practice Problems:

From the sample data shown in the video, here is the data that needs to be modified:

The 10th student ID is missing
The 5th and 9th years in school are not numbers like the rest of the column
The 2nd and 8th GPA's need to be two decimal places, having zeroes where need be.
The sixth year in school and the fifth GPA are inaccurate. There is no way a high school student is in Year 20, and we can assume that it is highly unlikely that a GPA higher than 4.0 is possible in this scenario

2.3.2 - Extracting Information from Data

Essential Knowledge

Information is the collection of facts and patters extracted from data
Data provides opportunities for identifying trends, making connections, and addressing problems
Digitally processed data may show correlation between variables. A correlation found in data does not necessarily indicate that a causal relationship exists. Additional research is needed to understand the exact nature of the relationship.
Often a single source does not contain the data needed to draw conclusion.
Metadata are data about data
Changes and deletions made to metadata do not change the primary data
Metadata are used for finding, organizing, and managing information.
Metadata can increase the effective use of data or data sets by providing additional information
Metadata allow data to be structured and organized.

What is metadata?

Prefix meta
- behind, among, between
- metadata - data about data
Some data has information about itself
- Author
- Date
- Length/Size
Why?
- Identify
- Track/Organize
- Process
Example: A photo includes:
- Date
- Time
- Location
- Height
- Width
- Pixels
- Type of compression

Practice Problems

Metadata in the book mentioned in the video:

Author
Title
Genre
Date of Publication
Length

How can it be used?

Suggest other books
Organize inventory

Finding Patterns

Data allows us to gain knowledge:
- Look for trends
- Look for patterns
- Answer questions
Be Careful!
Look out for misleading trends/patterns
Correlation does not equal causation

What I Took Away

Data compression - lossless for keeping as much data as possible, lossy for removing all small details - is used to transmit or store data with less bits for storage and time complexity.
Extracting information from data requires programmers to go beyond the data set, using tools such as metadata and additional resources.