What is Domain Generation Algorithm (DGA)? (How It Works, How to Detect?)

<p>Cyber-attackers utilize a Domain Generation Algorithm (DGA) to generate new domain names and IP addresses for malware's command and control servers. It is practically hard for security professionals to detect and limit the attack since it is carried out in a seemingly random manner.</p><p>"Conficker A and B", a family of worms that created 250 domain names every day in the beginning, promoted the tactic. Starting with "Conficker C", the virus would produce 50,000 domain names each day and contact 500 of them, providing an infected workstation a 1% chance of getting updated every day if the malware controllers only registered one domain per day.</p><p>Law enforcement would have had to pre-register 50,000 new domain names every day to prevent compromised machines from upgrading their software. From the botnet owner's perspective, they just need to register one or a few domains out of the many that each bot will query every day.</p><p>DGAs, on the other hand, are simple to build, tough to block, and maybe impossible to anticipate ahead of time. They may also be swiftly updated if the previously employed method is discovered.</p><p>A DGA usually consists of three parts −</p><ul class="list"><li><p>A "seed" that is time-sensitive</p></li><li><p>This seed is used in a domain "body" generator.</p></li><li><p>A collection of top-level domains (TLDs) (TLDs)</p></li></ul><p>The seed is frequently just the current date in a standard format. The domain body generator is the most important component of a DGA, and it may be anything from a random string of letters to a concatenation of random phrases to a constant section followed by a changeable suffix. However, the collection of TLDs must include real-world values that specify whose Web entities the created domains are registered under.</p><h2>How Does a Domain Generation Algorithm Work?</h2><p>DGAs create domains over time that act as meeting locations for infected hosts and the C&C server to keep the scheme running. The DGA uses one of many strategies to create new names for its C&C server at regular intervals.</p><p>It may produce what appears to be a random sequence of numbers or characters (in reality, it starts with a random seed value, just as so-called random number generators) and add a top-level domain suffix (e.g., .com or .org). Pseudorandom number generators, on the other hand, generate numbers that look to be random.</p><p>Each cycle usually yields hundreds or thousands of domain names. To create a new C&C DNS record, attackers just need to register one of those domains (which is normally done automatically). These domains are launched in a predictable rhythm that the virus or botnet recognizes. Bad actors can also set the DGA to register new domains at whatever intervals are convenient for them—every day, hour, or even minute.</p><h2>How to Detect DGAs?</h2><p>Blacklists can be used to prohibit DGA domain names; however, their coverage is either insufficient (public blacklists) or wildly inconsistent (private blacklists) (commercial vendor blacklists).</p><p>There are two types of detection techniques −<em>reactionary</em> and <em>real-time</em>. In order to assess domain name authenticity, the former method uses statistical data such as DNS replies, IP address location, WHOIS, and TLS certificate information. The former approach examines the domain name as a regular series of characters, whereas the latter examines the domain name as a whole.</p><p>The following are the most common techniques used −</p><ul class="list"><li><p>N-gram splitting followed by frequency analysis of the domain name</p></li><li><p>Calculating the domain's entropy (this works poorly with non-ASCII domains and dictionary-based DGAs)</p></li><li><p>Analyzing the domain with recurrent neural networks (RNN).</p></li></ul><p>When it comes to performance, the N-gram technique is quite successful, but its accuracy is just mediocre, to say nothing of the implementation difficulty.</p><p>On the other hand, the entropy strategy is the most performant and uses the least amount of memory and CPU. It has the lowest accuracy of the three, but due to its ease of use, simplicity, and quickness, it may be used if you only need a ballpark estimate and don't mind occasional false alarms.</p><p>Then there's the machine learning method, which is the most intriguing. Well-trained neural networks may provide incredible outcomes with a low rate of false alarms. However, for great accuracy, it compromises performance and resource usage.</p>