Attribution: Version Control with Git, 2nd Edition[Book] (oreilly.com)
A Git repository is a database containing all of the information needed to retain and manage the revisions and history of a project. Within a repository, Git manages two primary data structures: the object store and the index. All of this repository data is stored at the root of your project in the .git directory.
The object store is designed to be efficiently copied during a clone. The index is private to a repository.
A Git repository is a database containing all of the information needed to retain and manage the revisions and history of a project. Within a repository, Git manages two primary data structures: the object store and the index. All of this repository data is stored at the root of your project in the .git directory.
The object store is designed to be efficiently copied during a clone. The index is private to a repository.
Object store
There are four types of objects in the object store: blobs, trees, commits, and tags. These are the four atomic data types that form all of Git's higher level data structures.
Each version of a file is represented by a blob (which is a contraction of binary large object). Blob is a common term in computing that refers to a variable or file that can contain any data and whose internal structure is ignored by the program. A blob holds a file's data but not any metadata about it such as its name.
A tree object represents one level of directory information. It records blob identifiers, path names, and metadata for all of the files in one directory. It can recursively reference other tree objects.
A commit object holds metadata for each change introduced into the repository like author, commit date, and log message. Each commit points to a tree object that captures one complete snapshot of the repository at the time the commit was performed. The initial commit, or root commit, has no parent. Most commits have one commit parent, but it is possible for a commit to have more than one parent.
A tag object assigns an arbitrary name to an object, usually a commit.
To use disk space and network bandwidth efficiently, Git also compresses and stores the objects in pack files which are also in the object store.
Index
The index is a temporary and dynamic binary file that describes the directory structure of the entire repository at one moment in time. Git commands allow you to stage changes in the index and also plays an important role in merges.
Content
Every object in the object store has a unique name which is generated from applying SHA1 to the contents of the object. SHA1 values are 160 bit values and are usually represented by 40 digit hexadecimal numbers. Sometimes the SHA1 is called a hash code or object ID. Since any tiny change to a file causes the SHA1 hash to change, the SHA1 hash is an effective globally unique identifier.
Git tracks content; the object store is based on hashes of the contents of its objects, not on their file names or paths.Â
Pack files
Git uses pack files as an efficient storage mechanism. With this, Git computes differences between files and stores those rather than complete versions of every file that are similar.
How the objects fit together
The blobs are at the bottom of the data structure. They do not reference anything and are referenced by tree objects. Tree objects point to blobs and possibly other trees. Any given tree can be pointed to by many commits (this is because it is possible commits made at different times and by different contributors resulted in the same content in the repo, that would be represented by the same tree with the same SHA1 value). A commit points to a tree that is introduced by the commit.Â
Article notes
Where is all of the Git repository data stored for a project (in the filesystem)?
At the root of your project in the .git directory
The term blob is a contraction of what?
Binary large object
What are the two main data structures that Git stores in the .git directory at the root of your project?
The object store and the index
Of the two primary data structures managed by Git, which is designed to be efficiently copied during a clone?
The object store
Of the two primary data structures managed by Git, which is private to a repository?
The index
What are the four types of objects in the Git object store?
Blobs, trees, commits, and tags
What are the four atomic data types that form all of Git's higher level data structures?
Blobs, trees, commits, and tags
Git represents every version of each file as what type of object?
Blob
What is a common term in computing that refers to a variable or file that can contain any data and whose internal structure is ignored?
Blob
What does a Git blob represent?
A file
Does a Git blob representing a file have the file's name?
No, the blob has no metadata about the file, only the file's data
What does a Git tree object represent?
A directory
What does a Git commit point to that represents the snapshot of the repository at the time the commit was performed?
A tree
What does a Git commit represent?
Metadata about a change introduced into the repository (author, date, log message)
What does a Git tag represent?
An arbitrary name assigned to a different Git object
What Git object represents one level of directory information?
A tree
Since a Git tree object represents a directory which can contain files and directories, what two types of Git objects does a Git tree reference?
Trees and blobs
What Git object holds metadata for each change introduced into the repository like author, commit date, and log message?
Commit
Each Git commit points to what type of object that captures a complete snapshot of the repo at that commit?
A tree
What is the Git commit that has no parent?
The initial/root commit
Can a Git commit have more than one parent?
Yes
What is the Git object type that assigns an arbitrary name to another Git object?
A tag
What type of files does Git use to make use of disk space efficiently, that involve computing diffs between similar files instead of storing every version of every file?
Pack files
What is the temporary and dynamic binary file used by Git that describes the entire structure of the repo, is where changes are staged, and plays an important role in merges?
The index
What are the two primary data structures managed by Git within a Git repository?
The object store and the index
How many bits is a SHA1 value?
160 bits
SHA1 values are usually represented by hexadecimal numbers with how many digits?
40
Why is the SHA1 hash used by Git essentially a globally unique identifier?
Even the tiniest change in the file content will cause it its value to change
What data type is at the bottom of the Git data structure and its objects do not reference anything?
Blobs
What Git data types do tree objects point to?
Trees and blobs
How is it that the same Git tree can by pointed to by different commits?
The content of the repo can be the same for multiple commits