Attribution: Version Control with Git, 2nd Edition[Book] (oreilly.com)

A Git repository is a database containing all of the information needed to retain and manage the revisions and history of a project. Within a repository, Git manages two primary data structures: the object store and the index. All of this repository data is stored at the root of your project in the .git directory.

The object store is designed to be efficiently copied during a clone. The index is private to a repository.

Object store


There are four types of objects in the object store: blobs, trees, commits, and tags. These are the four atomic data types that form all of Git's higher level data structures.

Each version of a file is represented by a blob (which is a contraction of binary large object). Blob is a common term in computing that refers to a variable or file that can contain any data and whose internal structure is ignored by the program. A blob holds a file's data but not any metadata about it such as its name.

A tree object represents one level of directory information. It records blob identifiers, path names, and metadata for all of the files in one directory. It can recursively reference other tree objects.

A commit object holds metadata for each change introduced into the repository like author, commit date, and log message. Each commit points to a tree object that captures one complete snapshot of the repository at the time the commit was performed. The initial commit, or root commit, has no parent. Most commits have one commit parent, but it is possible for a commit to have more than one parent.

A tag object assigns an arbitrary name to an object, usually a commit.

To use disk space and network bandwidth efficiently, Git also compresses and stores the objects in pack files which are also in the object store.

Index


The index is a temporary and dynamic binary file that describes the directory structure of the entire repository at one moment in time. Git commands allow you to stage changes in the index and also plays an important role in merges.

Content


Every object in the object store has a unique name which is generated from applying SHA1 to the contents of the object. SHA1 values are 160 bit values and are usually represented by 40 digit hexadecimal numbers. Sometimes the SHA1 is called a hash code or object ID. Since any tiny change to a file causes the SHA1 hash to change, the SHA1 hash is an effective globally unique identifier.

Git tracks content; the object store is based on hashes of the contents of its objects, not on their file names or paths. 

Pack files


Git uses pack files as an efficient storage mechanism. With this, Git computes differences between files and stores those rather than complete versions of every file that are similar.

How the objects fit together


The blobs are at the bottom of the data structure. They do not reference anything and are referenced by tree objects. Tree objects point to blobs and possibly other trees. Any given tree can be pointed to by many commits (this is because it is possible commits made at different times and by different contributors resulted in the same content in the repo, that would be represented by the same tree with the same SHA1 value). A commit points to a tree that is introduced by the commit. 

Article notes

Where is all of the Git repository data stored for a project (in the filesystem)?
The term blob is a contraction of what?
What are the two main data structures that Git stores in the .git directory at the root of your project?
Of the two primary data structures managed by Git, which is designed to be efficiently copied during a clone?
Of the two primary data structures managed by Git, which is private to a repository?
What are the four types of objects in the Git object store?
What are the four atomic data types that form all of Git's higher level data structures?
Git represents every version of each file as what type of object?
What is a common term in computing that refers to a variable or file that can contain any data and whose internal structure is ignored?
What does a Git blob represent?
Does a Git blob representing a file have the file's name?
What does a Git tree object represent?
What does a Git commit point to that represents the snapshot of the repository at the time the commit was performed?
What does a Git commit represent?
What does a Git tag represent?
What Git object represents one level of directory information?
Since a Git tree object represents a directory which can contain files and directories, what two types of Git objects does a Git tree reference?
What Git object holds metadata for each change introduced into the repository like author, commit date, and log message?
Each Git commit points to what type of object that captures a complete snapshot of the repo at that commit?
What is the Git commit that has no parent?
Can a Git commit have more than one parent?
What is the Git object type that assigns an arbitrary name to another Git object?
What type of files does Git use to make use of disk space efficiently, that involve computing diffs between similar files instead of storing every version of every file?
What is the temporary and dynamic binary file used by Git that describes the entire structure of the repo, is where changes are staged, and plays an important role in merges?
What are the two primary data structures managed by Git within a Git repository?
How many bits is a SHA1 value?
SHA1 values are usually represented by hexadecimal numbers with how many digits?
Why is the SHA1 hash used by Git essentially a globally unique identifier?
What data type is at the bottom of the Git data structure and its objects do not reference anything?
What Git data types do tree objects point to?
How is it that the same Git tree can by pointed to by different commits?
Next