Sqoop Interview Questions
Dear readers, these Sqoop Interview Questions have been designed specially to get you acquainted with the nature of questions you may encounter during your interview for the subject of Sqoop. As per my experience good interviewers hardly plan to ask any particular question during your interview, normally questions start with some basic concept of the subject and later they continue based on further discussion and what you answer:
To connect to different relational databases sqoop needs a connector. Almost every DB vendor makes this connecter available as a JDBC driver which is specific to that DB. So Sqoop needs the JDBC driver of each of the database it needs to inetract with.
No. Sqoop needs both JDBC and connector to connect to a database.
To specify a particular directory in HDFS use --target-dir but to specify the parent directory of all the sqoop jobs use --warehouse-dir. In this case under the parent directory sqoop will cerate a directory with the same name as th e table.
By using the WHERE clause in the sqoop import statement we can import only a subset of rows.
We can run a filtering query on the database and save the result to a temporary table in database.
Then use the sqoop import command without using the --where clause
The --password-file option can be used inside a sqoop script while the -P option reads from standard input , preventing automation.
To get the out file of a sqoop import in formats other than .gz like .bz2 we use the --compress -code parameter.
The native utilities used by databases to support faster laod do not work for binary data formats like SequenceFile.
The Parameter --num-mapers is used to control the number of mappers executed by a sqoop command. We should start with choosing a small number of map tasks and then gradually scale up as choosing high number of mappers initially may slow down the performance on the database side.
Using the command
--exclude-tables table1,table2 ..
This will import all the tables except the ones mentioned in the exclude-tables clause.
sqoop can have 2 approaches.
a − To use the --incremental parameter with append option where value of some columns are checked and only in case of modified values the row is imported as a new row.
b − To use the --incremental parameter with lastmodified option where a date column in the source is checked for records which have been updated after the last import.
The options file is used in sqoop to specify the command line values in a file and use it in the sqoop commands.
For example the --connect parameter's value and --user name value scan be stored in a file and used again and again with different sqoop commands.
Yes, we can add an argument to a saved job at runtime by using the --exec option
sqoop job --exec jobname -- -- newparameter
By using the --query parameter in place of --table parameter we can specify a sql query. The result of the query will be imported.
Using the --split-by parameter we specify the column name based on which sqoop will divide the data to be imported into multiple chunks to be run in parallel.
By using the --mapreduce-job-name parameter. Below is a example of the command.
sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --password sqoop \ --query 'SELECT normcities.id, \ countries.country, \ normcities.city \ FROM normcities \ JOIN countries USING(country_id) \ WHERE $CONDITIONS' \ --split-by id \ --target-dir cities \ --mapreduce-job-name normcities
We can use the --boundary –query parameter in which we specify the min and max value for the column based on which the split can happen into multiple mapreduce tasks. This makes it faster as the query inside the –boundary-query parameter is executed first and the job is ready with the information on how many mapreduce tasks to create before executing the main query.
The parameter “sqoop.export.records.per.statement” specifies the number of records that will be used in each insert statement.
But the parameter “sqoop.export.statements.per.transaction” specifies how many insert statements can be processed parallel during a transaction.
Using the staging-table option we first load the data into a staging table and then load it to the final target table only if the staging load is successful.
By specifying the –clear-staging-table option we can clear the staging table before it is loaded. This can be done again and again till we get proper data in staging.
The parameter --update-key can be used to update existing rows. In it a comma-separated list of columns is used which uniquely identifies a row. All of these columns is used in the WHERE clause of the generated UPDATE query. All other table columns will be used in the SET part of the query.
Truncate the target table and load it again.
By using the –column parameter in which we mention the required column names as a comma separated list of values.
By using the –input-null-string parameter we can specify a default value and that will allow the row to be inserted into the target table.
Oozie has in-built sqoop actions inside which we can mention the sqoop commands to be executed.
Some of the imported records might have null values in all the columns. As Hbase does not allow all null values in a row, those rows get dropped.
$ sqoop list-databases --connect jdbc:mysql://database.example.com/
Sqoop can import data form a relational database using any SQL query rather than only using table and column name parameters.
By using the –m 1 clause in the import command, sqoop cerates only one mapreduce task which will import the rows sequentially.
The Mapreduce cluster is configured to run 4 parallel tasks. So the sqoop command must have number of parallel tasks less or equal to that of the MapReduce cluster.
The –split-by clause mentions the column name based on whose value the data will be divided into groups of records. These group of records will be read in parallel by the mapreduce tasks.
It imports data from a database to a HDFS file named foo located in the directory /dest
Using the --append argument, Sqoop will import data to a temporary directory and then rename the files into the normal target directory in a manner that does not conflict with existing filenames in that directory.
By using the --map-column-java property we can configure the mapping between.
Below is an example
$ sqoop import ... --map-column-java id = String, value = Integer
By using the lastmodified mode. Rows where the check column holds a timestamp more recent than the timestamp specified with --last-value are imported.
Delimited text and Sequence Files.
$ sqoop import --connect jdbc:mysql://host/dbname --table EMPLOYEES \ --columns "employee_id,first_name,last_name"
$ sqoop import --connect jdbc:mysql://host/dbname --table table_name\ -m 8
It imports the employees who have joined after 9-NOv-2012.
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \ --split-by dept_id
It performs an incremental import of new data, after having already imported the first 100,0rows of a table
sqoop import-all-tables --connect jdbc:mysql://host/DB1
$ sqoop export --connect jdbc:mysql://host/DB1 --call proc1 \ --export-dir /Dir1
It is a tool using which Sqoop hosts a shared metadata repository. Multiple users and/or remote users can define and execute saved jobs (created with sqoop job) defined in this metastore.
Clients must be configured to connect to the metastore in sqoop-site.xml or with the --meta-connect argument.
The merge tool combines two datasets where entries in one dataset should overwrite entries of an older dataset preserving only the newest version of the records between both the data sets.
sqoop job –list
Sqoop job –show myjob
Running sqoop-metastore launches a shared HSQLDB database instance on the current machine.
The metastore database can be hosted anywhere within or outside of the Hadoop cluster..
What is Next ?
Further you can go through your past assignments you have done with the subject and make sure you are able to speak confidently on them. If you are fresher then interviewer does not expect you will answer very complex questions, rather you have to make your basics concepts very strong.
Second it really doesn't matter much if you could not answer few questions but it matters that whatever you answered, you must have answered with confidence. So just feel confident during your interview. We at tutorialspoint wish you best luck to have a good interviewer and all the very best for your future endeavor. Cheers :-)