Billions of unnecessary files in GitHub

Gabor Szabo - Dec 21 '22 - - Dev Community

As I was looking for easy assignments for the Open Source Development Course I found something very troubling which is also an opportunity for a lot of teaching and a lot of practice.

Some files don't need to be in git

The common sense dictates that we rarely need to include generated files in our git repository. There is no point in keeping them in our version control as they can be generated again. (The exception might be if the generation takes a lot of time or can be done only during certain phases of the moon.)

Neither is there a need to store 3rd party libraries in our git repository. Instead of that we store a list of our dependencies with the required version and then we download and install them. (Well, the rightfully paranoid might download and save a copy of every 3rd party library they use to ensure it can never disappear, but you'll see we are not talking about that).

.gitignore

The way to make sure that neither we nor anyone else adds these files to the git repository by mistake is to create a file called .gitignore, include patterns that match the files we would like to exclude from git and add the .gitignore file to our repository. git will ignore those file. They won't even show up when you run git status.

The format of the .gitignore file is described in the documentation of .gitignore.

In a nutshell:

/output.txt
Enter fullscreen mode Exit fullscreen mode

Ignore the output.txt file in the root of the project.

output.txt
Enter fullscreen mode Exit fullscreen mode

Ignore output.txt anywhere in the project. (in the root or any subdirectory)

*.txt
Enter fullscreen mode Exit fullscreen mode

Ignore all the files with .txt extension

venv
Enter fullscreen mode Exit fullscreen mode

Ignore the venv folder anywhere in the project.

There are more. Check the documentation of .gitignore!

Not knowing about .gitignore

Apparently a lot of people using git and GitHub don't know about .gitignore

The evidence:

Python developers use something called virtualenv to make it easy to use different dependencies in different projects. When they create a virtualenv they usually configure it to install all the 3rd party libraries in a folder called venv. This folder we should not include in git. And yet:

There are 452M hits for this search venv

In a similar way NodeJS developers install their dependencies in a folder called node_modules. There are 2B responses for this search: node_modules

Finally, if you use the Finder applications on macOS and open a folder, it will create an empty(!) file called .DS_Store. This file is really not needed anywhere. And yet I saw many copies of it on GitHub. Unfortunately so far I could not figure out how to search for them. The closest I found is this search.

Misunderstanding .gitignore

There are also many people who misunderstand the way .gitignore works. I can understand it as the wording of the explanation is a bit ambiguous. What we usually say is that

If you'd like to make sure that git will ignore the __pycache__ folder then you need to put it in .gitignore.

A better way would be to say this:

If you'd like to make sure that git will ignore the __pycache__ folder then you need to put its name in the .gitignore file.

Without that people might end up creating a folder called .gitignore and moving all the __pycache__ folder to this .gitignore folder. You can see it in this search

Help

Can you suggest other common cases of unnecessary files in git that should be ignored?

Can you help me creating the search for .DS_store in GitHub?

Updates

More based on the comments:

  • .o files the result of compilation of C and C++ code: .o
  • .class files the result of compilation of Java code: .class
  • .pyc files are compiled Python code. Usually stored in the __pycache__ folder mentioned earlier: .pyc

How to create a .gitignore file?

A follow-up post:

