Julia Programming - Working with Datasets



In this chapter, we shall discuss in detail about datasets.

CSV files

As we know that CSV (Comma Separated Value) file is a plain text file which uses commas to separate fields and values of those fields. The extension of these files is .CSV. We have various methods provided by Julia programming language to perform operations on CSV files.

Import a .CSV file in Julia

To import a .CSV file, we need to install CSV package. Use the following command to do so −

using pkgpkg.add("CSV")

Reading data

To read data from a CSV file in Julia we need to use read() method from CSV package as follows −

julia> using CSVjulia> CSV.read("C://Users//Leekha//Desktop//Iris.csv")1506 DataFrame Row   Id    SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm  Species              Int64       Float64       Float64       Float64       Float64    String         -------  1     1         5.1           3.5           1.4           0.2        Iris-setosa      2     2         4.9           3.0           1.4           0.2        Iris-setosa      3     3         4.7           3.2           1.3           0.2        Iris-setosa      4     4         4.6           3.1           1.5           0.2        Iris-setosa      5     5         5.0           3.6           1.4           0.2        Iris-setosa      6     6         5.4           3.9           1.7           0.4        Iris-setosa      7     7         4.6           3.4           1.4           0.3        Iris-setosa      8     8         5.0           3.4           1.5           0.2        Iris-setosa      9     9         4.4           2.9           1.4           0.2        Iris-setosa      10    10        4.9           3.1           1.5           0.1        Iris-setosa    ⋮ 140  140         6.9           3.1           5.4           2.1        Iris-virginica  141  141         6.7           3.1           5.6           2.4        Iris-virginica  142  142         6.9           3.1           5.1           2.3        Iris-virginica  143  143         5.8           2.7           5.1           1.9        Iris-virginica  144  144         6.8           3.2           5.9           2.3        Iris-virginica  145  145         6.7           3.3           5.7           2.5        Iris-virginica  146  146         6.7           3.0           5.2           2.3        Iris-virginica  147  147         6.3           2.5           5.0           1.9        Iris-virginica  148  148         6.5           3.0           5.2           2.0        Iris-virginica  149  149         6.2           3.4           5.4           2.3        Iris-virginica  150  150         5.9           3.0           5.1           1.8        Iris-virginica 

Creating new CSV file

To create new CSV file, we need to use touch()command from CSV package. We also need to use DataFrames package to write the newly created content to new CSV file −

julia> using DataFramesjulia> using CSVjulia> touch("1234.csv")"1234.csv"julia> new = open("1234.csv", "w")IOStream(<file 1234.csv>)julia> new_data = DataFrame(Name = ["Gaurav", "Rahul", "Aarav", "Raman", "Ravinder"],                  RollNo = [1, 2, 3, 4, 5],                  Marks = [54, 67, 90, 23, 95])                  53 DataFrame Row   Name     RollNo  Marks        String   Int64   Int64  1     Gaurav     1       54   2     Rahul      2       67   3      Aarav     3       90   4      Raman     4       23   5    Ravinder    5       95  julia> CSV.write("1234.csv", new_data)"1234.csv"julia> CSV.read("1234.csv")53 DataFrame Row     Name   RollNo  Marks        String   Int64   Int64    1    Gaurav    1       54     2    Rahul     2       67     3    Aarav     3       90     4    Raman     4       23     5  Ravinder    5       95  

HDF5

The full form of HDF5 is Hierarchical Data Format v5. Following are some of its properties −

  • A group is similar to a directory, a dataset is like a file.

  • To associate metadata with a particular group, it uses attributes.

  • It uses ASCII names for different objects.

  • Language wrappers are often known as low level or high level.

Opening HDF5 files

HDF5 files can be opened with h5open command as follows −

fid = h5open(filename, mode)

Following table describes the mode −

Sl.No Mode & Meaning
1

"r"

read-only

2

"r+"

read-write − It will preserve any existing contents.

3

"cw"

read-write − It will create file if not existing.

It will also preserve existing contents.

4

"w"

read-write − It will destroy any existing contents.

The above command will produce an object of type HDF5File and a subtype of the abstract type DataFile.

Closing HDF5 files

Once finished with a file, we should close it as follows −

close(fid)

It will also close all the objects in the file.

Opening HDF5 objects

Suppose if we have a file object named fid and it has a group called object1, it can be opened as follows −

Obj1 = fid[object1]

Closing HDF5 objects

close(obj1)

Reading data

A group g containing a dataset with path dtset and we have opened dataset as dset1 = g[dtset]. We can read the information in following ways −

ABC = read(dset1)ABC = read(g, "dtset")Asub = dset1[2:3, 1:3]

Writing data

We can create the dataset as follows −

g["dset1"] = rand(3,5)write(g, "dset1", rand(3,5))

XML files

Here we will be discussing about LightXML.jl package which is a light-weight Julia wrapper for libxml2. It provides the following functionalities −

  • Parsing an XML file

  • Accessing XML tree structure

  • Creating an XML tree

  • Exporting an XML tree to a string

Example

Suppose we have an xml file named new.xml as follows −

<Hello>      <to>Gaurav</to>      <from>Rahul</from>      <heading>Reminder to meet</heading>      <body>Friend, Don't forget to meet this weekend!</body></Hello>

Now, we can parse this file by using LightXML as follows −

julia> using LightXML#below code will parse this xml filejulia> xdoc = parse_file("C://Users//Leekha//Desktop//new.xml")<?xml version="1.0" encoding="utf-8"?><Hello><to>Gaurav</to><from>Rahul</from><heading>Reminder to meet</heading><body>Friend, Don't forget to meet this weekend!</body></Hello>

Following example explains how to get the root element −

julia> xroot = root(xdoc);julia> println(name(xroot))Hello#Traversing all the child nodes and also print element namesjulia> for c in child_nodes(xroot) # c is an instance of XMLNode            println(nodetype(c))            if is_elementnode(c)               e = XMLElement(c) # this makes an XMLElement instance               println(name(e))            end         end31to31from31heading31body3

RDatasets

Julia has RDatasets.jl package providing easy way to use and experiment with most of the standard data sets which are available in the core of R. To load and work with one of the datasets included in RDatasets packages, we need to install RDatasets as follows −

julia> using Pkgjulia> Pkg.add("RDatasets")

Subsetting the data

For example, we will use the Gcsemv dataset in mlmRev group as follows −

julia> GetData = dataset("mlmRev","Gcsemv");julia> summary(GetData);julia> head(GetData)65 DataFrame Row      School        Student       Gender     Written    Course       Categorical  Categorical  Categorical  Float64  Float64    1      20920          16            M         23.0      missing    2      20920          25            F         missing    71.2      3      20920          27            F         39.0       76.8      4      20920          31            F         36.0       87.9      5      20920          42            M         16.0       44.4      6      20920          62            F         36.0      missing 

We can select the data for a particular school as follows −

julia> GetData[GetData[:School] .== "68137", :]1045 DataFrame Row      School        Student       Gender     Written    Course       Categorical  Categorical  Categorical  Float64  Float64   1       68137          1            F           18.0      56.4     2       68137          2            F           23.0      55.5     3       68137          3            F           25.0     missing   4       68137          4            F           29.0      73.1     5       68137          5            F          missing    66.6     6       68137          9            F           20.0      60.1     7       68137         11            F           34.0      63.8     8       68137         12            F           60.0      89.8     9       68137         13            F           44.0      76.8     10      68137         14            F           20.0      58.3   &vellip; 94       68137         252           M          missing    75.9    95       68137         254           M             35.0  missing   96       68137         255           M             36.0    62.0    97       68137         258           M             23.0    61.1    98       68137         260           M             25.0  missing   99       68137         261           M             46.0     89.8   100      68137         264           M             50.0     70.3   101      68137         268           M             15.0     43.5   102      68137         270           M          missing     73.1   103      68137         272           M             43.0     78.7   104      68137         273           M             35.0     60.1  

Sorting the data

With the help of sort!() function, we can sort the data. For example, here we will sort the dataset in ascending examination scores −

julia> sort!(GetData, cols=[:Written])19055 DataFrame Row         School       Student        Gender   Written    Course        Categorical  Categorical  Categorical  Float64  Float64   1       22710            77          F            0.6      41.6     2       68137            65          F            2.5      50.0     3       22520            115         M            3.1      9.25     4       68137            80          F            4.3      50.9     5       68137            79          F            7.5      27.7     6       22710            57          F            11.0     73.1     7       64327            19          F            11.0     87.0     8       68137            85          F            11.0     27.7     9       68137            97          F            11.0     57.4    10       68137            100         F            11.0     61.1   &vellip; 1895     74874            83          F         missing      81.4   1896     74874            86          F         missing      92.5   1897     76631            79          F         missing      84.2   1898     76631            193         M         missing      72.2   1899     76631            221         F         missing      76.8   1900     77207            5001        F         missing      82.4   1901     77207            5062        M         missing      75.0   1902     77207            5063        F         missing      79.6   1903     84772            17          M         missing      88.8   1904     84772            49          M         missing      74.0   1905     84772            85          F         missing      90.7  

Statistics in Julia

To work with statistics, Julia has StatsBase.jl package providing easy way to do simple statistics. To work with statistics, we need to install StatsBase package as follows −

julia> using Pkgjulia> Pkg.add("StatsBase")

Simple Statistics

Julia provides methods to define weights and calculate mean.

We can use weights() function to define weights vectors as follows −

julia> WV = Weights([10.,11.,12.])3-element Weights{Float64,Float64,Array{Float64,1}}: 10.0 11.0 12.0

You can use the isempty() function to check whether the weight vector is empty or not −

julia> isempty(WV)false

We can check the type of weight vectors with the help of eltype() function as follows −

julia> eltype(WV)Float64

We can check the length of the weight vectors with the help of length() function as follows −

julia> length(WV)3

There are different ways to calculate the mean

  • Harmonic mean − We can use harmmean() function to calculate the harmonic mean.

julia> A = [3, 5, 6, 7, 8, 2, 9, 10]8-element Array{Int64,1}: 3 5 6 7 8 2 9 10julia> harmmean(A)4.764831009217679
  • Geometric mean − We can use geomean() function to calculate the Geometric mean.

julia> geomean(A)5.555368605381863
  • General mean − We can use mean() function to calculate the general mean.

julia> mean(A)6.25

Descriptive Statistics

It is that discipline of statistics in which information is extracted and analyzed. This information explains the essence of data.

Calculating variance

We can use var() function to calculate the variance of a vector as follows −

julia> B = [1., 2., 3., 4., 5.];julia> var(B)2.5

Calculating weighted variance

We can calculate the weighted variance of a vector x w.r.t to weight vector as follows −

julia> B = [1., 2., 3., 4., 5.];julia> a = aweights([4., 2., 1., 3., 1.])5-element AnalyticWeights{Float64,Float64,Array{Float64,1}}: 4.0 2.0 1.0 3.0 1.0julia> var(B, a)2.066115702479339

Calculating standard deviation

We can use std() function to calculate the standard variation of a vector as follows −

julia> std(B)1.5811388300841898

Calculating weighted standard deviation

We can calculate the weighted standard deviation of a vector x w.r.t to weight vector as follows −

julia> std(B,a)1.4373989364401725

Calculating mean and standard deviation

We can calculate the mean and standard deviation in a single command as follows −

julia> mean_and_std(B,a)(2.5454545454545454, 1.4373989364401725)

Calculating mean and variance

We can calculate the mean and variance in a single command as follows −

julia> mean_and_var(B,a)(2.5454545454545454, 2.066115702479339)

Samples and Estimations

It may be defined as the discipline of statistics where, for analysis, sample units will be selected from a large population set.

Following are the ways in which we can do sampling −

Taking random samples is the simplest way of doing sampling. In this we draw a random element from the array, i.e., the population set. The function for this purpose is sample().

Example

julia> A = [8.,12.,23.,54.5]4-element Array{Float64,1}: 8.0 12.0 23.0 54.5julia> sample(A)12.0

Next, we can take n elements as random samples.

Example

julia> A = [8.,12.,23.,54.5]4-element Array{Float64,1}: 8.0 12.0 23.0 54.5julia> sample(A, 2)2-element Array{Float64,1}: 23.0 54.5

We can also write the sampled elements to pre-allocated elements of length n. The function to do this task is sample!().

Example

julia> B = [1., 2., 3., 4., 5.];julia> X = [2., 1., 3., 2., 5.];julia> sample!(B,X)5-element Array{Float64,1}: 2.0 2.0 4.0 1.0 3.0

Another way is to do direct sampling which will randomly picks the numbers from a population set and stores them in another array. The function to do this task is direct_sample!().

Example

julia> StatsBase.direct_sample!(B, X)5-element Array{Float64,1}: 1.0 4.0 4.0 4.0 5.0

Knuths algorithms is one other way in which random sampling is done without replcement.

Example

julia> StatsBase.knuths_sample!(B, X)5-element Array{Float64,1}: 5.0 3.0 4.0 2.0 1.0
Advertisements