Setup:
Visit the World Bank databank’s website:
https://databank.worldbank.org/source/jobs/preview/on
In the top of the panel on the right, click on “Add Country”, “Add Series” and “Add Time” and, for each of those, click “Select All”, exit out of that window, and click on “Apply Changes”. You should have 242 countries, 166 economic series and 27 years.
Programmatically or manually, download the table as a CSV file. Additionally, download the metadata for Country, Series, Country-Series and Series-Time.
Questions:
Using PySpark, please answer the following questions about the data set (a Python function for each question that returns the answer would be best). Note: we would like to see a Python or Scala Spark API for at least 2 out of the first 3 questions (i.e., limit your use of the Spark SQL interface to at most one question).
For each region (e.g., Sub-Saharan Africa, Europe and Central Asia, etc.), which country has the highest “Employers, female (% of female employment) (modeled ILO estimate)” in 2010? The output (either to the console or to a file) should include the region, country, and value (as shown below).
Europe, Germany, x%
Central Asia, Kazakhstan, y%
…
For each region, compute the weighted average percentage of “Employers, female (% of female employment) (modeled ILO estimate)” (for a region’s weighted average, the value of the metric for each individual country in the region is weighted by the size of its population). The output (either to the console or to a file) should include the region and value (as shown below).
Europe, x%
Central Asia, y%
…
For every country in Europe, for every year between 1999 and 2015, provide the total number of flight departures (“Air transport, registered carrier departures worldwide”) for every country over the previous 5 years, inclusive of the reporting year (i.e., for a given country in 2014, include the sum of all flights in 2014, 2013, 2012, 2011 and 2010) . For instance:
Germany, 2015, X1
Germany, 2014, X2
…
Germany, 1999, X17
France, 2015, Y1
France, 2014, Y2
…
France, 1999, Y17
…
Let us define a ‘Knowledge date’ as the date that a particular value was available to be queried from the system. For this dataset, assume each annual data series is assembled by the 7th day of the following year (i.e., the knowledge date for any value in our dataset for 2003 is compiled on 01/07/2004). However, also assume any data point could be revised or corrected at any time; i.e., effectively resulting in multiple knowledge dates (one for each value) for the same data date for the corrected item.
Make the necessary changes to the table to handle this bi-temporality, and introduce two simulated corrections to “Air transport, registered carrier departures worldwide“ in the Czech Republic for 2003:
Changing it from the original value of 52127 to 64921, with an effective correction date of 06/15/2004.
Changing it again (a second correction) to 56284, with an effective correction date of 07/11/2004.
As part of this new table design, how would you write a generic query to obtain the same results with a given as-of date (i.e., the query should return the same results you would have obtained if you walked back in time to that as-of date and ran it with the most recent data available then). As an example, the output below shows the value for different as-of dates.
As-of date
value
03/30/2004
52127
06/30/2004
64921
07/30/2004
56284
We want you to think about how you would design a system in AWS to deal with time series based data as in the previous questions (including the addition of the “knowledge date”), but at a much larger scale. For the (“Air transport, registered carrier departures worldwide”, imagine that the data is for every hour, instead of every year. Other data streams could be per second or minute. Please write a high-level architecture document (2 pages, plus diagram) on how you would bring the data into AWS, transform it, store it and provide an efficient means for people to query that data.
What AWS services and other third-party frameworks would you use?
How would you bring in the data? How do you ensure data quality? How do you deal with incremental updates?
How do you store the data to achieve both flexibility and performance? Storage technologies, formats and layout? Partitions and keys?
How would you provide the ability for downstream users to do queries and processing efficiently, especially for massive parallelism? What are the most important considerations for performance?
We do not put a lot of priority on how the write-up looks as long as the thinking is clearly presented and we can have a productive discussion after reading it (i.e. do not spend a lot of time on formatting). A suggestion is to use the icons from AWS to quickly be able to put together a diagram (https://aws.amazon.com/architecture/icons/).