Introduction to Git for Data Science


The data science and engineering fields are interacting more and more because data scientists are working on production systems and joining R&D teams. We want to make it simpler for data scientists without prior engineering experience to understand the core engineering best practices.

We are building a manual on engineering subjects like Git, Docker, cloud infrastructure, and model serving that we hear data science practitioners think about.

Introduction to Git

A version control system called Git is made to keep track of changes made to a source code over time.

Without a version control system, a collaboration between multiple people working on the same project is complete confusion. As no one has kept track of their modifications and it becomes very difficult to combine them into a single fundamental fact, resolving the eventual conflicts becomes impossible. Git and more advanced systems built on top of it (such as GitHub) provide tools to solve this issue.

Typically, each user will clone a single central repository to their local system (referred to as "origin" or "remote") which the individual users will clone to their local machine (called "local" or "clone"). Users "push" and "merge" their completed work back into the central repository once they have stored relevant work (referred to as "commits") on their computers.

Difference between Git and GitHub

Git serves as both the foundational technology, for tracking and merging changes in a source code, and its command-line client (CLI).

An online platform called GitHub was created on top of git technology to make it simpler. Additionally, it provides capabilities like automation, pulls requests, and user management. GitLab and Sourcetree are two additional options.

Git for Data Science

In data science we are going to analyze the data using some models and algorithms, a model might be created by more than one person which makes it hard to handle and makes updates at the same time, but Git makes this all easy by storing the previous versions and allowing many peoples to work on the same project at a single time.

Let’s look into some terms of Git which are very common among developers

Terms

  • Repository − "Database" containing all of a project's branches and commits

  • Branch − A repository's alternative state or route of development.

  • Merge − Merging two (or more) branches into one branch, one truth is the definition of the merge.

  • Clone − The process of locally copying a remote repository.

  • Origin − The local clone was made from a remote repository, which is referred to as the origin.

  • Main/Master − Common names for the root branch, which is the main repository of truth, include "main" and "master."

  • Stage − Choosing which files to include in the new commit at this stage

  • Commit − A stored snapshot of the staged modifications made to the file(s) in the repository is known as a "commit."

  • HEAD − Abbreviation for the current commit in your local repository.

  • Push − Sending changes to a remote repository for public viewing is known as pushing.

  • Pull − Pulling is the process of adding other people's updates to your personal repository.

  • Pull Request − Before merging your modifications to main/master, use the pull request mechanism to examine and approve them.

As we have discussed above do for that we need some commands that are generally used, lets discussed them below −

  • git init − Create a new repository on your local computer.

  • git clone − begin editing an already-existing remote repository.

  • git add − Select the file or files to save (staging).

  • Show the files you have modified with git status.

  • git commit − Store a copy of the selected file(s) as a snapshot (commit).

  • Send your saved snapshots (commits) into the distant repository using the git push command.

  • Pull current commits made by others into your own computer using the git pull command.

  • Create or remove branches with the git branch.

  • git checkout − Change branches or reverse local file(s) modifications.

  • git merge − merges branches with git to create a single branch or a single truth.

Rules for Handling Git Process Smooth

Git has many advantages in real projects but there are some rules or cases where the user needs to proceed with some data or steps carefully for security or other reasons.

There are some rules for handling the smooth process of uploading a project over GitHub

Don't push datasets

Git is used to tracking, manage, and store the codes but it is not a good practice to put the datasets over it. Keep track of the data there are many good data trackers available.

Don't push secrets

As codes posted or pushed on GitHub or git may be private and users can make the public also. But even if the codes or data pushed on the git hub is private, it is not recommended to put secrets such as passwords there because of security reasons.

Don't use the --force

−force method is used in various situations but it is not recommended to use it mostly because while pushing the code to git if there is an error, it will be displayed by the compiler or the CLI to use the force method to put the data on the server but it is not a good approach.

Do small commits with clear descriptions

Beginners developers may not be as good with the small commits but it is recommended to do the small commits as they make the view of the development process much clear and helps out in future updates. Also writing a good and clear description makes the same process much easier.

Conclusion

A version control system called Git is made to keep track of changes made to a source code over time. Without a version control system, a collaboration between multiple people working on the same project is complete confusion. Git serves as both the foundational technology, for tracking and merging changes in a source code, and its command-line client (CLI). An online platform called GitHub was created on top of git technology to make it simpler. Additionally, it provides capabilities like automation, pulls requests, and user management.

Updated on: 11-Jan-2023

875 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements