{"id":78076,"date":"2021-11-30T22:22:17","date_gmt":"2021-11-30T22:22:17","guid":{"rendered":"https:\/\/papersspot.com\/blog\/2021\/11\/30\/setup-visit-the-world-bank-databanks-website-https-databank-worldbank-org-source-jobs-preview-on-in-the-top-of\/"},"modified":"2021-11-30T22:22:17","modified_gmt":"2021-11-30T22:22:17","slug":"setup-visit-the-world-bank-databanks-website-https-databank-worldbank-org-source-jobs-preview-on-in-the-top-of","status":"publish","type":"post","link":"https:\/\/papersspot.com\/blog\/2021\/11\/30\/setup-visit-the-world-bank-databanks-website-https-databank-worldbank-org-source-jobs-preview-on-in-the-top-of\/","title":{"rendered":"Setup: Visit the World Bank databank\u2019s website: https:\/\/databank.worldbank.org\/source\/jobs\/preview\/on In the top of"},"content":{"rendered":"<p>Setup:<br \/> Visit the World Bank databank\u2019s website:<br \/> https:\/\/databank.worldbank.org\/source\/jobs\/preview\/on<\/p>\n<p> In the top of the panel on the right, click on \u201cAdd Country\u201d, \u201cAdd Series\u201d and \u201cAdd Time\u201d and, for each of those, click \u201cSelect All\u201d, exit out of that window, and click on \u201cApply Changes\u201d. You should have 242 countries, 166 economic series and 27 years.<\/p>\n<p> Programmatically or manually, download the table as a CSV file. Additionally, download the metadata for Country, Series, Country-Series and Series-Time.<\/p>\n<p> Questions:<\/p>\n<p> Using PySpark, please answer the following questions about the data set (a Python function for each question that returns the answer would be best). Note: we would like to see a Python or Scala Spark API for at least 2 out of the first 3 questions (i.e., limit your use of the Spark SQL interface to at most one question).<\/p>\n<p> For each region (e.g., Sub-Saharan Africa, Europe and Central Asia, etc.), which country has the highest \u201cEmployers, female (% of female employment) (modeled ILO estimate)\u201d in 2010? The output (either to the console or to a file) should include the region, country, and value (as shown below).<\/p>\n<p> Europe, Germany, x%<\/p>\n<p> Central Asia, Kazakhstan, y%<\/p>\n<p> \u2026<\/p>\n<p> For each region, compute the weighted average percentage of \u201cEmployers, female (% of female employment) (modeled ILO estimate)\u201d (for a region\u2019s weighted average, the value of the metric for each individual country in the region is weighted by the size of its population). The output (either to the console or to a file) should include the region and value (as shown below).<\/p>\n<p> Europe, x%<\/p>\n<p> Central Asia, y%<\/p>\n<p> \u2026<\/p>\n<p> For every country in Europe, for every year between 1999 and 2015, provide the total number of flight departures (\u201cAir transport, registered carrier departures worldwide\u201d) for every country over the previous 5 years, inclusive of the reporting year (i.e., for a given country in 2014, include the sum of all flights in 2014, 2013, 2012, 2011 and 2010) . For instance: <\/p>\n<p> Germany, 2015, X1<\/p>\n<p> Germany, 2014, X2<\/p>\n<p> \u2026<\/p>\n<p> Germany, 1999, X17<\/p>\n<p> France, 2015, Y1<\/p>\n<p> France, 2014, Y2<\/p>\n<p> \u2026<\/p>\n<p> France, 1999, Y17<\/p>\n<p> \u2026<\/p>\n<p> Let us define a \u2018Knowledge date\u2019 as the date that a particular value was available to be queried from the system. For this dataset, assume each annual data series is assembled by the 7th day of the following year (i.e., the knowledge date for any value in our dataset for 2003 is compiled on 01\/07\/2004). However, also assume any data point could be revised or corrected at any time; i.e., effectively resulting in multiple knowledge dates (one for each value) for the same data date for the corrected item. <\/p>\n<p> Make the necessary changes to the table to handle this bi-temporality, and introduce two simulated corrections to \u201cAir transport, registered carrier departures worldwide\u201c in the Czech Republic for 2003:<\/p>\n<p> Changing it from the original value of 52127 to 64921, with an effective correction date of 06\/15\/2004.<\/p>\n<p> Changing it again (a second correction) to 56284, with an effective correction date of 07\/11\/2004.<\/p>\n<p> As part of this new table design, how would you write a generic query to obtain the same results with a given as-of date (i.e., the query should return the same results you would have obtained if you walked back in time to that as-of date and ran it with the most recent data available then). As an example, the output below shows the value for different as-of dates.<\/p>\n<p> As-of date<\/p>\n<p> value<\/p>\n<p> 03\/30\/2004<\/p>\n<p> 52127<\/p>\n<p> 06\/30\/2004<\/p>\n<p> 64921<\/p>\n<p> 07\/30\/2004<\/p>\n<p> 56284<\/p>\n<p> We want you to think about how you would design a system in AWS to deal with time series based data as in the previous questions (including the addition of the \u201cknowledge date\u201d), but at a much larger scale. For the (\u201cAir transport, registered carrier departures worldwide\u201d, imagine that the data is for every hour, instead of every year. Other data streams could be per second or minute. Please write a high-level architecture document (2 pages, plus diagram) on how you would bring the data into AWS, transform it, store it and provide an efficient means for people to query that data.<\/p>\n<p> What AWS services and other third-party frameworks would you use?<\/p>\n<p> How would you bring in the data? How do you ensure data quality? How do you deal with incremental updates?<\/p>\n<p> How do you store the data to achieve both flexibility and performance? Storage technologies, formats and layout? Partitions and keys?<\/p>\n<p> How would you provide the ability for downstream users to do queries and processing efficiently, especially for massive parallelism? What are the most important considerations for performance?<\/p>\n<p> We do not put a lot of priority on how the write-up looks as long as the thinking is clearly presented and we can have a productive discussion after reading it (i.e. do not spend a lot of time on formatting). A suggestion is to use the icons from AWS to quickly be able to put together a diagram (https:\/\/aws.amazon.com\/architecture\/icons\/).<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Setup: Visit the World Bank databank\u2019s website: https:\/\/databank.worldbank.org\/source\/jobs\/preview\/on In the top of the panel on the right, click on \u201cAdd Country\u201d, \u201cAdd Series\u201d and \u201cAdd Time\u201d and, for each of those, click \u201cSelect All\u201d, exit out of that window, and click on \u201cApply Changes\u201d. You should have 242 countries, 166 economic series and 27 years. [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[10],"class_list":["post-78076","post","type-post","status-publish","format-standard","hentry","category-research-paper-writing","tag-writing"],"_links":{"self":[{"href":"https:\/\/papersspot.com\/blog\/wp-json\/wp\/v2\/posts\/78076","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/papersspot.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/papersspot.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/papersspot.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/papersspot.com\/blog\/wp-json\/wp\/v2\/comments?post=78076"}],"version-history":[{"count":0,"href":"https:\/\/papersspot.com\/blog\/wp-json\/wp\/v2\/posts\/78076\/revisions"}],"wp:attachment":[{"href":"https:\/\/papersspot.com\/blog\/wp-json\/wp\/v2\/media?parent=78076"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/papersspot.com\/blog\/wp-json\/wp\/v2\/categories?post=78076"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/papersspot.com\/blog\/wp-json\/wp\/v2\/tags?post=78076"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}