{"id":4220,"date":"2024-11-05T12:06:29","date_gmt":"2024-11-05T11:06:29","guid":{"rendered":"https:\/\/aubertm.me\/?p=4220"},"modified":"2025-02-24T12:22:32","modified_gmt":"2025-02-24T11:22:32","slug":"etl-comparison-in-ms-azure-aws-and-gcp","status":"publish","type":"post","link":"https:\/\/aubertm.me\/index.php\/2024\/11\/05\/etl-comparison-in-ms-azure-aws-and-gcp\/","title":{"rendered":"ETL comparison in MS Azure, AWS and GCP"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">In order to compare how technologies handle data, I set up an ETL funnel in each environement, extracting data from Wikipedia and transforming it into a PARQUET file.<\/p>\n\n\n\n<!--more-->\n\n\n\n<ol><li><a class=\"wp-block-table-of-contents__entry\" href=\"https:\/\/aubertm.me\/2024\/11\/05\/etl-comparison-in-ms-azure-aws-and-gcp\/#from-wikipedia\">From Wikipedia<\/a><ol><li><a class=\"wp-block-table-of-contents__entry\" href=\"https:\/\/aubertm.me\/2024\/11\/05\/etl-comparison-in-ms-azure-aws-and-gcp\/#first-step-build-the-sparql-query\">First step: build the SPARQL query<\/a><\/li><li><a class=\"wp-block-table-of-contents__entry\" href=\"https:\/\/aubertm.me\/2024\/11\/05\/etl-comparison-in-ms-azure-aws-and-gcp\/#second-step-enrich-the-sparql-query-with-aggregation\">Second step: enrich the SPARQL query with aggregation<\/a><\/li><li><a class=\"wp-block-table-of-contents__entry\" href=\"https:\/\/aubertm.me\/2024\/11\/05\/etl-comparison-in-ms-azure-aws-and-gcp\/#third-step-generate-a-data-feed\">Third step: generate a data feed<\/a><\/li><\/ol><\/li><li><a class=\"wp-block-table-of-contents__entry\" href=\"https:\/\/aubertm.me\/2024\/11\/05\/etl-comparison-in-ms-azure-aws-and-gcp\/#to-ms-azure-data-factory\">To MS Azure Data Factory<\/a><\/li><li><a class=\"wp-block-table-of-contents__entry\" href=\"https:\/\/aubertm.me\/2024\/11\/05\/etl-comparison-in-ms-azure-aws-and-gcp\/#to-aws-glue\">To AWS Glue<\/a><\/li><li><a class=\"wp-block-table-of-contents__entry\" href=\"https:\/\/aubertm.me\/2024\/11\/05\/etl-comparison-in-ms-azure-aws-and-gcp\/#to-gcp\">To GCP<\/a><\/li><li><a class=\"wp-block-table-of-contents__entry\" href=\"https:\/\/aubertm.me\/2024\/11\/05\/etl-comparison-in-ms-azure-aws-and-gcp\/#to-denodo\">My conclusion<\/a><\/li><\/ol>\n\n\n<h2 class=\"wp-block-heading\" id=\"from-wikipedia\">From Wikipedia<\/h2>\n\n\n<p class=\"wp-block-paragraph\">Data description: women&#8217;s profiles that are less tagged, less connected to groups, and therefore quite lost within the information system. I limited my target to the French-written part of Wikipedia<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">What I learned (the hard way) on Wikipedia: <a href=\"https:\/\/www.mediawiki.org\/wiki\/API:Main_page\" target=\"_blank\" rel=\"noreferrer noopener\">APIs connect to articles&#8217; management metadata<\/a> and <a href=\"https:\/\/www.wikidata.org\/wiki\/Wikidata:SPARQL_query_service\" target=\"_blank\" rel=\"noreferrer noopener\">SPARQL connect to data and categories<\/a><\/p>\n\n\n<h3 class=\"wp-block-heading\" id=\"first-step-build-the-sparql-query\">First step: build the SPARQL query<\/h3>\n\n\n<p class=\"wp-block-paragraph\">I won&#8217;t give a course on Wikidata&#8217;s architecture: the <a href=\"https:\/\/query.wikidata.org\/index.html\" target=\"_blank\" rel=\"noreferrer noopener\">Wikidata Query Service<\/a> offers a Query Helper to help finding the parameters for building the right query.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/aubertm.me\/wp-content\/uploads\/2024\/10\/image.png?w=800\" alt=\"\" class=\"wp-image-4270\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">I combined :<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>parameter P21 sex or gender: category Q6581072 female<\/li>\n\n\n\n<li>parameter P1412 languages spoken, written or signed: category Q150 French<\/li>\n\n\n\n<li>parameter of wikibase:language &#8220;fr&#8221;<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">In order to check that the query runs, it&#8217;s better to first count results.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>SELECT (COUNT (?item) as ?count)\nWHERE {\n  ?item wdt:P21 wd:Q6581072;\n    wdt:P1412 wd:Q150. \n  SERVICE wikibase:label { bd:serviceParam wikibase:language \"fr\".}  \n}<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Result: 70349 records (<a href=\"https:\/\/w.wiki\/BQTq\" target=\"_blank\" rel=\"noreferrer noopener\">view live<\/a>)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Then the query can be checked with data vizualisation: let&#8217;s limit results to 100.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>SELECT ?item ?itemLabel\nWHERE {\n  ?item wdt:P21 wd:Q6581072;\n    wdt:P1412 wd:Q150. \n  SERVICE wikibase:label { bd:serviceParam wikibase:language \"fr\".}  \n}\nLIMIT 100<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Result: 100 results in 3422&nbsp;ms (<a href=\"https:\/\/w.wiki\/BQU3\" target=\"_blank\" rel=\"noreferrer noopener\">view live<\/a>)<\/p>\n\n\n<h3 class=\"wp-block-heading\" id=\"second-step-enrich-the-sparql-query-with-aggregation\">Second step: enrich the SPARQL query with aggregation<\/h3>\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s then aggregate the fields &#8220;has part(s)&#8221; and &#8220;part of&#8221;:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>SELECT ?item ?itemLabel\n(COUNT (?has_part_s_) as ?count_has_part_s_)\n(COUNT (?part_of) as ?count_part_of)\nWHERE {\n  ?item wdt:P21 wd:Q6581072;\n    wdt:P1412 wd:Q150.\n  SERVICE wikibase:label { bd:serviceParam wikibase:language \"fr\".} \n  OPTIONAL { ?item wdt:P527 ?has_part_s_. }\n  OPTIONAL { ?item wdt:P361 ?part_of. }\n}\nGROUP BY ?item ?itemLabel<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Results: 70349 r\u00e9sultats en 43383\u202fms (<a href=\"https:\/\/w.wiki\/BRAg\" target=\"_blank\" rel=\"noreferrer noopener\">view live<\/a>)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">NB: to get <em>?itemLabel<\/em> appearing in the results, it has to be mentionned in GROUP BY.<\/p>\n\n\n<h3 class=\"wp-block-heading\" id=\"third-step-generate-a-data-feed\">Third step: generate a data feed<\/h3>\n\n\n<p class=\"wp-block-paragraph\">The Wikidata SPARQL sandbox offers many different ways of sharing:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Sharing the query via: URL,HTML, Wikilink, PHP, JavaScript (jQuery), JavaScript (modern), Java, Perl, Python, Python (Pywikibot), Ruby, R, Matlab, listeria, mapframe<\/li>\n\n\n\n<li>Download results via: JSON file, JSON file (verbose), TSV file, TSV file (verbose), CSV file, HTML table<\/li>\n\n\n\n<li>Get a link to: Short URL to query, Short URL to result, SPARQL Endpoint, Embed result<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">We&#8217;ll check with each technologie what&#8217;s the best feed format to use.<\/p>\n\n\n<h2 class=\"wp-block-heading\" id=\"to-ms-azure-data-factory\">To MS Azure Data Factory<\/h2>\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/feedback.azure.com\/d365community\/idea\/8e287df8-0e25-ec11-b6e6-000d3a4f0858\" target=\"_blank\" rel=\"noreferrer noopener\">MS Azure does not handle SPARQL<\/a>. It is therefore necessary to export the dataflow in another format: I chose as a JSON feed. The good practices ask to import it into a JSON file in an Azure blob storage before transforming it into another file format. Finally, I chose PARQUET as an an handier end file format.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Create a new pipeline in <a href=\"https:\/\/adf.azure.com\/\">Azure Data Factory<\/a> \n<ul class=\"wp-block-list\">\n<li>To copy the JSON feed into a JSON file<\/li>\n\n\n\n<li>Mention Wikidata time limit: 60 seconds; retry after 120seconds<\/li>\n\n\n\n<li>Import its schema<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Source: create a new datasource with a GET connection to HTTP anonymous URL\n<ul class=\"wp-block-list\">\n<li>as base URL <a href=\"https:\/\/query.wikidata.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/query.wikidata.org\/<\/a>&nbsp;<\/li>\n\n\n\n<li>as relative URL as REST in JSON format and the SPARQL query\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/query.wikidata.org\/sparql?format=json&amp;query=SELECT%20?item%20?itemLabel%20(COUNT%20(?has_part_s_)%20as%20?count_has_part_s_)%20(COUNT%20(?part_of)%20as%20?count_part_of)%20WHERE%20{%20?item%20wdt:P21%20wd:Q6581072;%20wdt:P1412%20wd:Q150.%20SERVICE%20wikibase:label%20{%20bd:serviceParam%20wikibase:language%20%22fr%22.}%20OPTIONAL%20{%20?item%20wdt:P527%20?has_part_s_.%20}%20OPTIONAL%20{%20?item%20wdt:P361%20?part_of.%20}%20}%20GROUP%20BY%20?item%20?itemLabel\">\/sparql?format=json&amp;query=SELECT ?item ?itemLabel (COUNT (?has_part_s_) as ?count_has_part_s_) (COUNT (?part_of) as ?count_part_of) WHERE { ?item wdt:P21 wd:Q6581072; wdt:P1412 wd:Q150. SERVICE wikibase:label { bd:serviceParam wikibase:language &#8220;fr&#8221;.} OPTIONAL { ?item wdt:P527 ?has_part_s_. } OPTIONAL { ?item wdt:P361 ?part_of. } } GROUP BY ?item ?itemLabel<\/a><\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>and a header to announce you&#8217;re not a bot; for example:\n<ul class=\"wp-block-list\">\n<li>User-Agent\u202f: AzureLearning\/0.0 (https:\/\/www.aubertm.me marina@aubertm.me)&nbsp;<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n\n\n\n<ol start=\"2\" class=\"wp-block-list\">\n<li>Destination: create a <a href=\"https:\/\/portal.azure.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">storage account<\/a> then a blob container to store the JSON output dataset.<\/li>\n\n\n\n<li>Map the source fields for the destination fields.<\/li>\n\n\n\n<li><a href=\"https:\/\/stackoverflow.com\/questions\/70341943\/azure-data-factory-copy-task-using-rest-api-is-only-returning-first-row-upon-e\" target=\"_blank\" rel=\"noreferrer noopener\">\/!\\ Important<\/a>: validate all, publish all, then import a second time the output dataset format.<br>Then again : validate all, publish all<br>Validate the pipeline, Validate runtime, Debug<\/li>\n\n\n\n<li>Create a second pipeline in Azure Data Factory\n<ul class=\"wp-block-list\">\n<li>To copy and flatten the JSON file into a PARQUET file<\/li>\n\n\n\n<li>Source: the JSON file you just created<\/li>\n\n\n\n<li>Destination: you can use the same blob container as before.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>again :<br>\/!\\ Important: validate all, publish all, then import a second time the output dataset format.<br>Then again : validate all, publish all<br>Validate the pipeline, Validate runtime, Debug<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">You can visualize data from the datasets.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/aubertm.me\/wp-content\/uploads\/2024\/10\/image-3.png?w=800\" alt=\"\" class=\"wp-image-4339\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/github.com\/aubertmme\/wikipedia_f_fr\/tree\/ms_azure_data_factory\" target=\"_blank\" rel=\"noreferrer noopener\">Code is here<\/a><\/p>\n\n\n<h2 class=\"wp-block-heading\" id=\"to-aws-glue\">To AWS Glue<\/h2>\n\n\n<p class=\"wp-block-paragraph\">AWS does not handle SPARQL either. Let&#8217;s keep on using our Wikidata&#8217;s JSON feed.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>In <a href=\"https:\/\/aws.amazon.com\/fr\/glue\/\" target=\"_blank\" rel=\"noreferrer noopener\">AWS Glue Studio<\/a>, configure IAM<\/li>\n\n\n\n<li>In <a href=\"https:\/\/aws.amazon.com\/fr\/s3\/\" target=\"_blank\" rel=\"noreferrer noopener\">Amazon&nbsp;S3 storage<\/a>, create a bucket and its endpoint<\/li>\n\n\n\n<li>In AWS Glue Studio,\n<ul class=\"wp-block-list\">\n<li>create a Python notebook to connect the Wikidata&#8217;s JSON feed to the S3 storage space<\/li>\n\n\n\n<li>create a Visual ETL\n<ul class=\"wp-block-list\">\n<li>Source: select the S3 location of the JSON file you just created with the notebook; pick JSON format<br>\/!\\ leave Recursive blank<\/li>\n\n\n\n<li>Transform: Flatten<\/li>\n\n\n\n<li>Target: same S3 location; pick PARQUET format<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Hit RUN<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Vizualise your new PARQUET file in S3 container<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/aubertm.me\/wp-content\/uploads\/2024\/10\/image-2.png?w=800\" alt=\"\" class=\"wp-image-4336\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/github.com\/aubertmme\/wikipedia_f_fr\/tree\/aws_glue_studio\" target=\"_blank\" rel=\"noreferrer noopener\">Code is here<\/a><\/p>\n\n\n<h2 class=\"wp-block-heading\" id=\"to-gcp\">To GCP<\/h2>\n\n\n<p class=\"wp-block-paragraph\">Google Cloud Service only accepts TSV files, but then Wikidata refuses the googlebot&#8217;s  request. So I had to go through BigQuery Studio throught a Python request.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>In <a href=\"https:\/\/console.cloud.google.com\/storage\/\" target=\"_blank\" rel=\"noreferrer noopener\">Cloud Storage<\/a>, create a bucket<\/li>\n\n\n\n<li>In <a href=\"https:\/\/console.cloud.google.com\/bigquery\" target=\"_blank\" rel=\"noreferrer noopener\">BigQuery Studio<\/a>, \n<ul class=\"wp-block-list\">\n<li>create a Python notebook to transfer the Wikidata&#8217;s JSON feed to your bucket as JSON file<\/li>\n\n\n\n<li>create a dataset and table from the JSON file, with automatic detection of its scheme<\/li>\n\n\n\n<li>export the new table to Google Cloud Storage as a PARQUET file<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Download your files in Google Cloud Storage, and vizualise them locally<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/aubertm.me\/wp-content\/uploads\/2024\/10\/image-4.png?w=800\" alt=\"\" class=\"wp-image-4346\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/github.com\/aubertmme\/wikipedia_f_fr\/tree\/aws_glue_studio\" target=\"_blank\" rel=\"noreferrer noopener\">Code <\/a><a href=\"https:\/\/github.com\/aubertmme\/wikipedia_f_fr\/tree\/gcp_bigquery_studio\" target=\"_blank\" rel=\"noreferrer noopener\">is <\/a><a href=\"https:\/\/github.com\/aubertmme\/wikipedia_f_fr\/tree\/aws_glue_studio\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a><\/p>\n\n\n<h2 class=\"wp-block-heading\" id=\"to-denodo\">My conclusion<\/h2>\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Criteria<\/th><th>MS Azure<\/th><th>AWS Glue<\/th><th>GCP<\/th><\/tr><\/thead><tbody><tr><td>Direct connection to the feed<\/td><td>No: from JSON feed <br>to JSON file<\/td><td>Yes: work directly from the JSON feed<\/td><td>No: from JSON feed to JSON file<\/td><\/tr><tr><td>Data vizualisation<\/td><td>Yes: from input feed to output dataset<\/td><td>Yes: only output dataset<\/td><td>No: have to download output dataset<\/td><\/tr><tr><td>Easy to manipulate<\/td><td>Yes, but details to validate and deploy again and again<\/td><td>Yes, but still with some specific terms to learn<\/td><td>Yes, but a bunch of old tools in the way<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">AWS Glue&#8217;s visual interface is much easier to understand than the multiple validations and publications needed in MS Azure Data Factory, or the estonishing non-web native manipulation in GCP.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In order to compare how technologies handle data, I set up an ETL funnel in each environement, extracting data from Wikipedia and transforming it into a PARQUET file.<\/p>\n","protected":false},"author":2,"featured_media":4815,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[3],"tags":[9],"class_list":["post-4220","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-architecture-governance","tag-en"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"https:\/\/aubertm.me\/wp-content\/uploads\/2024\/10\/image-3.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/aubertm.me\/index.php\/wp-json\/wp\/v2\/posts\/4220","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/aubertm.me\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/aubertm.me\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/aubertm.me\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/aubertm.me\/index.php\/wp-json\/wp\/v2\/comments?post=4220"}],"version-history":[{"count":1,"href":"https:\/\/aubertm.me\/index.php\/wp-json\/wp\/v2\/posts\/4220\/revisions"}],"predecessor-version":[{"id":4821,"href":"https:\/\/aubertm.me\/index.php\/wp-json\/wp\/v2\/posts\/4220\/revisions\/4821"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/aubertm.me\/index.php\/wp-json\/wp\/v2\/media\/4815"}],"wp:attachment":[{"href":"https:\/\/aubertm.me\/index.php\/wp-json\/wp\/v2\/media?parent=4220"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/aubertm.me\/index.php\/wp-json\/wp\/v2\/categories?post=4220"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/aubertm.me\/index.php\/wp-json\/wp\/v2\/tags?post=4220"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}