Daru Use Case: Analyzing baby names over a period of time.

In this example we'll read data about prevalance of baby names in the US from a bunch of CSV files and try to analyze to data to figure out interesting trends from it.

In [1]:
require 'daru'
require 'gnuplotrb'
Out[1]:
true

The data is contained in multiple CSV files on a per year basis.

Here's what a raw CSV file looks like for the year 1951. The first column is the name, second the sex and the third the number of births that took place with that name.

Find the data here

In [2]:
CSV.read("data/yob1951.txt").first(10)
Out[2]:
[["Linda", "F", "73933"], ["Mary", "F", "65689"], ["Patricia", "F", "56422"], ["Deborah", "F", "42043"], ["Barbara", "F", "40588"], ["Susan", "F", "40207"], ["Nancy", "F", "30335"], ["Karen", "F", "27986"], ["Sandra", "F", "27656"], ["Kathleen", "F", "26703"]]

All the CSV files are loaded into a DataFrame, a new column 'year' added to them for identification, and then concatenated to produce one large DataFrame.

In [3]:
data_frame = Daru::DataFrame.from_csv("data/yob1951.txt", 
  headers: ['name', 'sex', 'births'])
data_frame['year'] = [1951] * data_frame.size

(1952..2014).each do |year|
  temp = Daru::DataFrame.from_csv("data/yob#{year}.txt", 
    headers: ['name', 'sex', 'births'])
  temp['year'] = [year] * temp.size
  data_frame = data_frame.concat(temp)
end
data_frame.vectors = Daru::Index.new(['births','name', 'sex','year'])

data_frame
Out[3]:
Daru::DataFrame:110707160 rows: 1353205 cols: 4
birthsnamesexyear
073933LindaF1951
165689MaryF1951
256422PatriciaF1951
342043DeborahF1951
440588BarbaraF1951
540207SusanF1951
630335NancyF1951
727986KarenF1951
827656SandraF1951
926703KathleenF1951
1024773CarolF1951
1124070DonnaF1951
1223927SharonF1951
1321700BrendaF1951
1419835DianeF1951
1518559PamelaF1951
1617683MargaretF1951
1717061DebraF1951
1816949JanetF1951
1916289CynthiaF1951
2015947JaniceF1951
2115187CarolynF1951
2215103ElizabethF1951
2315016ChristineF1951
2414929JudithF1951
2514005JudyF1951
2613909ShirleyF1951
2713468JoyceF1951
2812818BettyF1951
2911987CherylF1951
3010947GloriaF1951
3110930RebeccaF1951
...............
13532045ZyrinM2014

Pivot the DataFrame on the year as the row and sex as the column, using the 'births' column for aggregation.

This tells us the number of male and female births per year.

In [4]:
pivoted = data_frame.pivot_table(
  index: ['year'], vectors: ['sex'], agg: :sum, values: 'births')
Out[4]:
Daru::DataFrame:132293920 rows: 64 cols: 2
["births", "F"]["births", "M"]
[1951]18000421881080
[1952]18546981944277
[1953]18803261969777
[1954]19416822037374
[1955]19546642057918
[1956]20075122113694
[1957]20441602155866
[1958]20108842120712
[1959]20230442133509
[1960]20220932132717
[1961]20173162122502
[1962]19665482068945
[1963]19272172031755
[1964]18945941993270
[1965]17650011861378
[1966]16918681783964
[1967]16507641744527
[1968]16401031738928
[1969]16869471789732
[1970]17481471859594
[1971]16634751769201
[1972]15211851622666
[1973]14581391559338
[1974]14674131573105
[1975]14576991562207
[1976]14650961569904
[1977]15329971643684
[1978]15316581642250
[1979]16050511721947
[1980]16599331783876
[1981]16674651790907
[1982]16926781813970
.........
[2014]17687751901376

The row and column names are rather inconvienient so we rename them into something better.

In [5]:
pivoted.index = Daru::DateTimeIndex.date_range(:start => '1951', :periods => pivoted.size, freq: 'YEAR')
pivoted.vectors = Daru::Index.new(['F', 'M'])
pivoted
Out[5]:
Daru::DataFrame:132293920 rows: 64 cols: 2
FM
1951-01-01T00:00:00+00:0018000421881080
1952-01-01T00:00:00+00:0018546981944277
1953-01-01T00:00:00+00:0018803261969777
1954-01-01T00:00:00+00:0019416822037374
1955-01-01T00:00:00+00:0019546642057918
1956-01-01T00:00:00+00:0020075122113694
1957-01-01T00:00:00+00:0020441602155866
1958-01-01T00:00:00+00:0020108842120712
1959-01-01T00:00:00+00:0020230442133509
1960-01-01T00:00:00+00:0020220932132717
1961-01-01T00:00:00+00:0020173162122502
1962-01-01T00:00:00+00:0019665482068945
1963-01-01T00:00:00+00:0019272172031755
1964-01-01T00:00:00+00:0018945941993270
1965-01-01T00:00:00+00:0017650011861378
1966-01-01T00:00:00+00:0016918681783964
1967-01-01T00:00:00+00:0016507641744527
1968-01-01T00:00:00+00:0016401031738928
1969-01-01T00:00:00+00:0016869471789732
1970-01-01T00:00:00+00:0017481471859594
1971-01-01T00:00:00+00:0016634751769201
1972-01-01T00:00:00+00:0015211851622666
1973-01-01T00:00:00+00:0014581391559338
1974-01-01T00:00:00+00:0014674131573105
1975-01-01T00:00:00+00:0014576991562207
1976-01-01T00:00:00+00:0014650961569904
1977-01-01T00:00:00+00:0015329971643684
1978-01-01T00:00:00+00:0015316581642250
1979-01-01T00:00:00+00:0016050511721947
1980-01-01T00:00:00+00:0016599331783876
1981-01-01T00:00:00+00:0016674651790907
1982-01-01T00:00:00+00:0016926781813970
.........
2014-01-01T00:00:00+00:0017687751901376

The number of male births vs. female births can then be plotted against each other using the GnuplotRB gem.

In [6]:
GnuplotRB::Plot.new(
  [pivoted['F'], with: 'lines', title: 'F'],
  [pivoted['M'], with: 'lines', title: 'M'], title: 'Total births by sex and year')
Out[6]:
Gnuplot Produced by GNUPLOT 5.0 patchlevel 3 1.4x106 1.5x106 1.6x106 1.7x106 1.8x106 1.9x106 2x106 2.1x106 2.2x106 01 Jan 1946 01 Jan 1952 01 Jan 1958 01 Jan 1964 01 Jan 1970 01 Jan 1976 01 Jan 1982 01 Jan 1988 01 Jan 1994 01 Jan 2000 01 Jan 2006 01 Jan 2012 01 Jan 2018 Total births by sex and year F F M M

To further prod into the data, lets perform an SQL style GROUP BY operation on the DataFrame on the 'year' column so that the DataFrame is divided into groups according to year.

In [7]:
groups_by_year = data_frame.group_by(['year'])
nil

This code interates over all the groups created by year and selects the rows that contain a particular name in them from.

In this manner we collect rows from every group on a per year basis and create a DataFrame for each name which tells the number of births of a name during a particular year.

For this example we'll choose the names James, Robert, Jessica and Sophia.

In [8]:
pieces = []
['James', 'Robert', 'Jessica', 'Sophia'].each do |name|
  rows = []
  groups_by_year.each_group do |group|
    rows << group.row[group['name'].index_of(name)]
  end
  pieces << Daru::DataFrame.rows(rows)
end
Out[8]:
["James", "Robert", "Jessica", "Sophia"]

Perform some basic preprocessing/cleaning on the DataFrame.

In [9]:
pieces.each do |df|
  df['year'].map! {|e| DateTime.new(e) }
  df.set_index('year')
  df.rename df['name'][0]
  df.delete_vector 'name'
end
Out[9]:
[
#<Daru::DataFrame:257429020 @name = James @size = 64>
               births        sex 
1951-01-01        259          F 
1952-01-01        261          F 
1953-01-01        237          F 
1954-01-01        226          F 
1955-01-01        246          F 
1956-01-01        249          F 
1957-01-01        281          F 
1958-01-01        252          F 
1959-01-01        296          F 
1960-01-01        288          F 
1961-01-01        286          F 
1962-01-01        262          F 
1963-01-01        311          F 
1964-01-01        317          F 
1965-01-01        303          F 
       ...        ...        ... 
, 
#<Daru::DataFrame:83461800 @name = Robert @size = 64>
               births        sex 
1951-01-01        208          F 
1952-01-01        195          F 
1953-01-01        228          F 
1954-01-01        191          F 
1955-01-01        235          F 
1956-01-01        214          F 
1957-01-01        255          F 
1958-01-01        235          F 
1959-01-01        237          F 
1960-01-01        271          F 
1961-01-01        289          F 
1962-01-01        229          F 
1963-01-01        256          F 
1964-01-01        278          F 
1965-01-01        255          F 
       ...        ...        ... 
, 
#<Daru::DataFrame:350000980 @name = Jessica @size = 64>
               births        sex 
1951-01-01        466          F 
1952-01-01        451          F 
1953-01-01        495          F 
1954-01-01        423          F 
1955-01-01        386          F 
1956-01-01        406          F 
1957-01-01        476          F 
1958-01-01        529          F 
1959-01-01        523          F 
1960-01-01        559          F 
1961-01-01        669          F 
1962-01-01        867          F 
1963-01-01       1120          F 
1964-01-01       1172          F 
1965-01-01       1530          F 
       ...        ...        ... 
, 
#<Daru::DataFrame:128568600 @name = Sophia @size = 64>
               births        sex 
1951-01-01        153          F 
1952-01-01        111          F 
1953-01-01        131          F 
1954-01-01        112          F 
1955-01-01        152          F 
1956-01-01        121          F 
1957-01-01        187          F 
1958-01-01        227          F 
1959-01-01        275          F 
1960-01-01        262          F 
1961-01-01        324          F 
1962-01-01        485          F 
1963-01-01        523          F 
1964-01-01        470          F 
1965-01-01        507          F 
       ...        ...        ... 
]
In [10]:
plots = []
pieces.each do |df|
  plot = GnuplotRB::Plot.new([
    df['births'], with: 'lines', title: df.name])
  plot.format_x = '%Y'
  plots << plot
end

GnuplotRB::Multiplot.new(*plots).tap do |mp|
  mp.layout = [2,2]
  mp.format_x = '%Y'
  mp.xtics = 'nomirror rotate by -45'
  mp.title = 'Prevalence of certain names according to year'
  mp.xlabel = 'Year'
  mp.ylabel = 'Occurences'
  mp.xrange = '"1945-01-01":"2016-01-01"'
end
Out[10]:
Gnuplot Produced by GNUPLOT 5.0 patchlevel 3 Prevalence of certain names according to year 0 50 100 150 200 250 300 350 400 1946 1952 1958 1964 1970 1976 1982 1988 1994 2000 2006 2012 2018 Occurences Year James James 0 50 100 150 200 250 300 350 1946 1952 1958 1964 1970 1976 1982 1988 1994 2000 2006 2012 2018 Occurences Year Robert Robert 0 10000 20000 30000 40000 50000 60000 1946 1952 1958 1964 1970 1976 1982 1988 1994 2000 2006 2012 2018 Occurences Year Jessica Jessica 0 5000 10000 15000 20000 25000 1946 1952 1958 1964 1970 1976 1982 1988 1994 2000 2006 2012 2018 Occurences Year Sophia Sophia