
- Julia - Home
- Julia - Overview
- Julia - Environment Setup
- Julia - Basic Syntax
- Julia - Arrays
- Julia - Tuples
- Integers & Floating-Point Numbers
- Julia - Rational & Complex Numbers
- Julia - Basic Operators
- Basic Mathematical Functions
- Julia - Strings
- Julia - Functions
- Julia - Flow Control
- Julia - Dictionaries & Sets
- Julia - Date & Time
- Julia - Files I/O
- Julia - Metaprogramming
- Julia - Plotting
- Julia - Data Frames
- Working with Datasets
- Julia - Modules and Packages
- Working with Graphics
- Julia - Networking
- Julia - Databases
- Julia Useful Resources
- Julia - Quick Guide
- Julia - Useful Resources
- Julia - Cheatsheet
- Julia - Discussion
Julia Programming - Working with Datasets
In this chapter, we shall discuss in detail about datasets.
CSV files
As we know that CSV (Comma Separated Value) file is a plain text file which uses commas to separate fields and values of those fields. The extension of these files is .CSV. We have various methods provided by Julia programming language to perform operations on CSV files.
Import a .CSV file in Julia
To import a .CSV file, we need to install CSV package. Use the following command to do so −
using pkgpkg.add("CSV")
Reading data
To read data from a CSV file in Julia we need to use read() method from CSV package as follows −
julia> using CSVjulia> CSV.read("C://Users//Leekha//Desktop//Iris.csv")1506 DataFrame Row Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species Int64 Float64 Float64 Float64 Float64 String ------- 1 1 5.1 3.5 1.4 0.2 Iris-setosa 2 2 4.9 3.0 1.4 0.2 Iris-setosa 3 3 4.7 3.2 1.3 0.2 Iris-setosa 4 4 4.6 3.1 1.5 0.2 Iris-setosa 5 5 5.0 3.6 1.4 0.2 Iris-setosa 6 6 5.4 3.9 1.7 0.4 Iris-setosa 7 7 4.6 3.4 1.4 0.3 Iris-setosa 8 8 5.0 3.4 1.5 0.2 Iris-setosa 9 9 4.4 2.9 1.4 0.2 Iris-setosa 10 10 4.9 3.1 1.5 0.1 Iris-setosa ⋮ 140 140 6.9 3.1 5.4 2.1 Iris-virginica 141 141 6.7 3.1 5.6 2.4 Iris-virginica 142 142 6.9 3.1 5.1 2.3 Iris-virginica 143 143 5.8 2.7 5.1 1.9 Iris-virginica 144 144 6.8 3.2 5.9 2.3 Iris-virginica 145 145 6.7 3.3 5.7 2.5 Iris-virginica 146 146 6.7 3.0 5.2 2.3 Iris-virginica 147 147 6.3 2.5 5.0 1.9 Iris-virginica 148 148 6.5 3.0 5.2 2.0 Iris-virginica 149 149 6.2 3.4 5.4 2.3 Iris-virginica 150 150 5.9 3.0 5.1 1.8 Iris-virginica
Creating new CSV file
To create new CSV file, we need to use touch()command from CSV package. We also need to use DataFrames package to write the newly created content to new CSV file −
julia> using DataFramesjulia> using CSVjulia> touch("1234.csv")"1234.csv"julia> new = open("1234.csv", "w")IOStream(<file 1234.csv>)julia> new_data = DataFrame(Name = ["Gaurav", "Rahul", "Aarav", "Raman", "Ravinder"], RollNo = [1, 2, 3, 4, 5], Marks = [54, 67, 90, 23, 95]) 53 DataFrame Row Name RollNo Marks String Int64 Int64 1 Gaurav 1 54 2 Rahul 2 67 3 Aarav 3 90 4 Raman 4 23 5 Ravinder 5 95 julia> CSV.write("1234.csv", new_data)"1234.csv"julia> CSV.read("1234.csv")53 DataFrame Row Name RollNo Marks String Int64 Int64 1 Gaurav 1 54 2 Rahul 2 67 3 Aarav 3 90 4 Raman 4 23 5 Ravinder 5 95
HDF5
The full form of HDF5 is Hierarchical Data Format v5. Following are some of its properties −
A group is similar to a directory, a dataset is like a file.
To associate metadata with a particular group, it uses attributes.
It uses ASCII names for different objects.
Language wrappers are often known as low level or high level.
Opening HDF5 files
HDF5 files can be opened with h5open command as follows −
fid = h5open(filename, mode)
Following table describes the mode −
Sl.No | Mode & Meaning |
---|---|
1 |
"r" read-only |
2 |
"r+" read-write − It will preserve any existing contents. |
3 |
"cw" read-write − It will create file if not existing. It will also preserve existing contents. |
4 |
"w" read-write − It will destroy any existing contents. |
The above command will produce an object of type HDF5File and a subtype of the abstract type DataFile.
Closing HDF5 files
Once finished with a file, we should close it as follows −
close(fid)
It will also close all the objects in the file.
Opening HDF5 objects
Suppose if we have a file object named fid and it has a group called object1, it can be opened as follows −
Obj1 = fid[object1]
Closing HDF5 objects
close(obj1)
Reading data
A group g containing a dataset with path dtset and we have opened dataset as dset1 = g[dtset]. We can read the information in following ways −
ABC = read(dset1)ABC = read(g, "dtset")Asub = dset1[2:3, 1:3]
Writing data
We can create the dataset as follows −
g["dset1"] = rand(3,5)write(g, "dset1", rand(3,5))
XML files
Here we will be discussing about LightXML.jl package which is a light-weight Julia wrapper for libxml2. It provides the following functionalities −
Parsing an XML file
Accessing XML tree structure
Creating an XML tree
Exporting an XML tree to a string
Example
Suppose we have an xml file named new.xml as follows −
<Hello> <to>Gaurav</to> <from>Rahul</from> <heading>Reminder to meet</heading> <body>Friend, Don't forget to meet this weekend!</body></Hello>
Now, we can parse this file by using LightXML as follows −
julia> using LightXML#below code will parse this xml filejulia> xdoc = parse_file("C://Users//Leekha//Desktop//new.xml")<?xml version="1.0" encoding="utf-8"?><Hello><to>Gaurav</to><from>Rahul</from><heading>Reminder to meet</heading><body>Friend, Don't forget to meet this weekend!</body></Hello>
Following example explains how to get the root element −
julia> xroot = root(xdoc);julia> println(name(xroot))Hello#Traversing all the child nodes and also print element namesjulia> for c in child_nodes(xroot) # c is an instance of XMLNode println(nodetype(c)) if is_elementnode(c) e = XMLElement(c) # this makes an XMLElement instance println(name(e)) end end31to31from31heading31body3
RDatasets
Julia has RDatasets.jl package providing easy way to use and experiment with most of the standard data sets which are available in the core of R. To load and work with one of the datasets included in RDatasets packages, we need to install RDatasets as follows −
julia> using Pkgjulia> Pkg.add("RDatasets")
Subsetting the data
For example, we will use the Gcsemv dataset in mlmRev group as follows −
julia> GetData = dataset("mlmRev","Gcsemv");julia> summary(GetData);julia> head(GetData)65 DataFrame Row School Student Gender Written Course Categorical Categorical Categorical Float64 Float64 1 20920 16 M 23.0 missing 2 20920 25 F missing 71.2 3 20920 27 F 39.0 76.8 4 20920 31 F 36.0 87.9 5 20920 42 M 16.0 44.4 6 20920 62 F 36.0 missing
We can select the data for a particular school as follows −
julia> GetData[GetData[:School] .== "68137", :]1045 DataFrame Row School Student Gender Written Course Categorical Categorical Categorical Float64 Float64 1 68137 1 F 18.0 56.4 2 68137 2 F 23.0 55.5 3 68137 3 F 25.0 missing 4 68137 4 F 29.0 73.1 5 68137 5 F missing 66.6 6 68137 9 F 20.0 60.1 7 68137 11 F 34.0 63.8 8 68137 12 F 60.0 89.8 9 68137 13 F 44.0 76.8 10 68137 14 F 20.0 58.3 ⋮ 94 68137 252 M missing 75.9 95 68137 254 M 35.0 missing 96 68137 255 M 36.0 62.0 97 68137 258 M 23.0 61.1 98 68137 260 M 25.0 missing 99 68137 261 M 46.0 89.8 100 68137 264 M 50.0 70.3 101 68137 268 M 15.0 43.5 102 68137 270 M missing 73.1 103 68137 272 M 43.0 78.7 104 68137 273 M 35.0 60.1
Sorting the data
With the help of sort!() function, we can sort the data. For example, here we will sort the dataset in ascending examination scores −
julia> sort!(GetData, cols=[:Written])19055 DataFrame Row School Student Gender Written Course Categorical Categorical Categorical Float64 Float64 1 22710 77 F 0.6 41.6 2 68137 65 F 2.5 50.0 3 22520 115 M 3.1 9.25 4 68137 80 F 4.3 50.9 5 68137 79 F 7.5 27.7 6 22710 57 F 11.0 73.1 7 64327 19 F 11.0 87.0 8 68137 85 F 11.0 27.7 9 68137 97 F 11.0 57.4 10 68137 100 F 11.0 61.1 ⋮ 1895 74874 83 F missing 81.4 1896 74874 86 F missing 92.5 1897 76631 79 F missing 84.2 1898 76631 193 M missing 72.2 1899 76631 221 F missing 76.8 1900 77207 5001 F missing 82.4 1901 77207 5062 M missing 75.0 1902 77207 5063 F missing 79.6 1903 84772 17 M missing 88.8 1904 84772 49 M missing 74.0 1905 84772 85 F missing 90.7
Statistics in Julia
To work with statistics, Julia has StatsBase.jl package providing easy way to do simple statistics. To work with statistics, we need to install StatsBase package as follows −
julia> using Pkgjulia> Pkg.add("StatsBase")
Simple Statistics
Julia provides methods to define weights and calculate mean.
We can use weights() function to define weights vectors as follows −
julia> WV = Weights([10.,11.,12.])3-element Weights{Float64,Float64,Array{Float64,1}}: 10.0 11.0 12.0
You can use the isempty() function to check whether the weight vector is empty or not −
julia> isempty(WV)false
We can check the type of weight vectors with the help of eltype() function as follows −
julia> eltype(WV)Float64
We can check the length of the weight vectors with the help of length() function as follows −
julia> length(WV)3
There are different ways to calculate the mean −
Harmonic mean − We can use harmmean() function to calculate the harmonic mean.
julia> A = [3, 5, 6, 7, 8, 2, 9, 10]8-element Array{Int64,1}: 3 5 6 7 8 2 9 10julia> harmmean(A)4.764831009217679
Geometric mean − We can use geomean() function to calculate the Geometric mean.
julia> geomean(A)5.555368605381863
General mean − We can use mean() function to calculate the general mean.
julia> mean(A)6.25
Descriptive Statistics
It is that discipline of statistics in which information is extracted and analyzed. This information explains the essence of data.
Calculating variance
We can use var() function to calculate the variance of a vector as follows −
julia> B = [1., 2., 3., 4., 5.];julia> var(B)2.5
Calculating weighted variance
We can calculate the weighted variance of a vector x w.r.t to weight vector as follows −
julia> B = [1., 2., 3., 4., 5.];julia> a = aweights([4., 2., 1., 3., 1.])5-element AnalyticWeights{Float64,Float64,Array{Float64,1}}: 4.0 2.0 1.0 3.0 1.0julia> var(B, a)2.066115702479339
Calculating standard deviation
We can use std() function to calculate the standard variation of a vector as follows −
julia> std(B)1.5811388300841898
Calculating weighted standard deviation
We can calculate the weighted standard deviation of a vector x w.r.t to weight vector as follows −
julia> std(B,a)1.4373989364401725
Calculating mean and standard deviation
We can calculate the mean and standard deviation in a single command as follows −
julia> mean_and_std(B,a)(2.5454545454545454, 1.4373989364401725)
Calculating mean and variance
We can calculate the mean and variance in a single command as follows −
julia> mean_and_var(B,a)(2.5454545454545454, 2.066115702479339)
Samples and Estimations
It may be defined as the discipline of statistics where, for analysis, sample units will be selected from a large population set.
Following are the ways in which we can do sampling −
Taking random samples is the simplest way of doing sampling. In this we draw a random element from the array, i.e., the population set. The function for this purpose is sample().
Example
julia> A = [8.,12.,23.,54.5]4-element Array{Float64,1}: 8.0 12.0 23.0 54.5julia> sample(A)12.0
Next, we can take n elements as random samples.
Example
julia> A = [8.,12.,23.,54.5]4-element Array{Float64,1}: 8.0 12.0 23.0 54.5julia> sample(A, 2)2-element Array{Float64,1}: 23.0 54.5
We can also write the sampled elements to pre-allocated elements of length n. The function to do this task is sample!().
Example
julia> B = [1., 2., 3., 4., 5.];julia> X = [2., 1., 3., 2., 5.];julia> sample!(B,X)5-element Array{Float64,1}: 2.0 2.0 4.0 1.0 3.0
Another way is to do direct sampling which will randomly picks the numbers from a population set and stores them in another array. The function to do this task is direct_sample!().
Example
julia> StatsBase.direct_sample!(B, X)5-element Array{Float64,1}: 1.0 4.0 4.0 4.0 5.0
Knuths algorithms is one other way in which random sampling is done without replcement.
Example
julia> StatsBase.knuths_sample!(B, X)5-element Array{Float64,1}: 5.0 3.0 4.0 2.0 1.0