Run SQL statements on Databricks SQL warehouses with the Databricks SQL Statement Execution REST API

您所在的位置：网站首页 › 梳理中国历史分期的方法 › Run SQL statements on Databricks SQL warehouses with the Databricks SQL Statement Execution REST API

Run SQL statements on Databricks SQL warehouses with the Databricks SQL Statement Execution REST API

2023-05-13 07:50| 来源: 网络整理| 查看: 265

Run SQL statements on Databricks SQL warehouses with the Databricks SQL Statement Execution REST API Article 05/09/2023

Important

This feature is in Public Preview.

Important

To access Databricks REST APIs, you must authenticate.

This tutorial shows how to use the Databricks SQL Statement Execution API 2.0 to run SQL statements from Databricks SQL warehouses.

To view and to download an OpenAPI Specification-compatible version of the Databricks SQL Statement Execution API 2.0 reference, see Statement Execution API 2.0.

Before you begin

Before you start this tutorial, be sure that you have:

The workspace instance name, for example adb-1234567890123456.7.azuredatabricks.net, for your Azure Databricks workspace. This tutorial assumes that you have an environment variable on your local development machine named DATABRICKS_HOST, which is set to this value. To learn how to set environment variables, see your operating system鈥檚 documentation.

Warning

Databricks strongly discourages hard-coding information into your scripts, as this sensitive information can be exposed in plain text through version control systems. Databricks recommends that you use approaches such as environment variables that you set on your development machine instead. Removing such hard-coded information from your scripts helps to make those scripts more portable as well.

An Azure Databricks personal access token or Azure Active Directory (Azure AD) token for your Azure Databricks workspace user. This tutorial assumes that you have an environment variable on your local development machine named DATABRICKS_TOKEN, which is set to the value of your token.

Note

As a security best practice, when authenticating with automated tools, systems, scripts, and apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. To create access tokens for service principals, see Manage access tokens for a service principal.

At least one table that you can execute SQL statements against. This tutorial is based on the lineitem table in the tpch schema (also known as a database) within the samples catalog. If you do not have access to this catalog, schema, or table from your workspace, substitute them throughout this tutorial with your own.

curl, a command-line tool for sending and receiving REST API requests and responses. See also Install curl. Alternatively, you can adapt this tutorial鈥檚 examples for use with similar tools such as Postman or HTTPie.

For each of the curl examples used in this article:

Instead of --header "Authorization: Bearer ${DATABRICKS_TOKEN}", you can use a .netrc file. If you use a .netrc file, replace --header "Authorization: Bearer ${DATABRICKS_TOKEN}" with --netrc. If you use the Windows Command shell instead of a command shell for Unix, Linux, or macOS, replace \ with ^, and replace ${...} with %...%. If you use the Windows Command shell instead of a command shell for Unix, Linux, or macOS, in JSON document declarations, replace the opening and closing ' with ", and replace inner " with \".

jq, a command-line processor for querying JSON response payloads, which the Databricks SQL Statement Execution API returns to you after each call that you make to the Databricks SQL Statement Execution API. See also Download jq.

A Databricks SQL warehouse. This tutorial assumes that you have an environment variable on your local development machine named DATABRICKS_SQL_WAREHOUSE_ID, which is the string of letters and numbers following /sql/1.0/warehouses/ in the HTTP path field for your warehouse. To learn how to get your warehouse鈥檚 HTTP path value, see Get connection details for a SQL warehouse.

The Databricks SQL warehouse must be set to use the Preview channel in Advanced options. Note that switching a warehouse from using the Current channel to the Preview channel will cause the warehouse to automatically restart. This upgrade could take several minutes.

Step 1: Execute a SQL statement and save the data result as JSON

Run the following command, which does the following:

Uses the specified SQL warehouse, along with the specified token, to query for three columns from the first two rows of the lineitem table in the tcph schema within the samples catalog. Saves the response payload in JSON format in a file named sql-execution-response.json within the current working directory. Prints the contents of the sql-execution-response.json file. Sets a local environment variable named SQL_STATEMENT_ID that contains the ID of the corresponding SQL statement. You can use this SQL statement ID for getting information about that statement later as needed, which is demonstrated in Step 2. You can also view this SQL statement and get its statement ID from the query history section of the Databricks SQL console, or by calling the Query History API 2.0. Sets an additional local environment variable named NEXT_CHUNK_EXTERNAL_LINK that contains an API URL fragment for getting the next chunk of JSON data. If the response data is too large, the Databricks SQL Statement Execution API provides the response in chunks. You can use this API URL fragment for getting the next chunk of data, which is demonstrated in Step 2. If there is no next chunk, then this environment variable is set to null. Prints the values of the SQL_STATEMENT_ID and NEXT_CHUNK_INTERNAL_LINK environment variables. curl --request POST \ https://${DATABRICKS_HOST}/api/2.0/sql/statements/ \ --header "Authorization: Bearer ${DATABRICKS_TOKEN}" \ --header "Content-Type: application/json" \ --data '{ "warehouse_id": "'"$DATABRICKS_SQL_WAREHOUSE_ID"'", "catalog": "samples", "schema": "tpch", "statement": "SELECT l_orderkey, l_extendedprice, l_shipdate FROM lineitem LIMIT 2" }' \ --output 'sql-execution-response.json' \ && jq . 'sql-execution-response.json' \ && export SQL_STATEMENT_ID=$(jq -r .statement_id 'sql-execution-response.json') \ && export NEXT_CHUNK_INTERNAL_LINK=$(jq -r .result.next_chunk_internal_link 'sql-execution-response.json') \ && echo SQL_STATEMENT_ID=$SQL_STATEMENT_ID \ && echo NEXT_CHUNK_INTERNAL_LINK=$NEXT_CHUNK_INTERNAL_LINK

In the preceding request:

By default, any returned data is in JSON array format, and the default location for any of the SQL statement鈥檚 data results is within the response payload. To make this behavior explicit, add "format":"JSON_ARRAY","disposition":"INLINE" to the request payload. If you attempt to return data results larger than 16 MiB in the response payload, a failure status is returned and the SQL statement is canceled. For data results larger than 16 MiB, you can use external links instead of trying to return it in the response payload, which is demonstrated in Step 3. curl stores the response payload鈥檚 contents to a local file. Local data storage is not supported by the Databricks SQL Statement Execution API directly. By default, after 10 seconds, if the SQL statement has not yet finished executing through the warehouse, the Databricks SQL Statement Execution API returns only the SQL statement ID and its current status, instead of the statement鈥檚 result. To change this behavior, add "wait_timeout":"s" to the request payload, where can be between 5 and 50 seconds inclusive, for example "wait_timeout":"50s". To return the SQL statement ID and its current status immediately, set wait_timeout to 0s. By default, the SQL statement continues to run if the timeout period is reached. To cancel a SQL statement if the timeout period is reached instead, add "on_wait_timeout":"CANCEL" to the request paylod.

If the statement鈥檚 result is available before the wait timeout ends, the response is as follows:

{ "statement_id": "01ed92c5-3583-1f38-b21b-c6773e7c56b3", "status": { "state": "SUCCEEDED" }, "manifest": { "format": "JSON_ARRAY", "schema": { "column_count": 3, "columns": [ { "name": "l_orderkey", "type_name": "LONG", "position": 0 }, { "name": "l_extendedprice", "type_name": "DECIMAL", "position": 1, "type_precision": 18, "type_scale": 2 }, { "name": "l_shipdate", "type_name": "DATE", "position": 2 } ] } }, "result": { "chunk_index": 0, "row_offset": 0, "row_count": 2, "data_array": [ [ "15997987", "66516.00", "1992-02-12" ], [ "15997988", "53460.96", "1994-05-31" ] ] } }

If the wait timeout ends before the statement鈥檚 result is available, the response looks like this instead:

{ "statement_id": "01ed92c5-3583-1f38-b21b-c6773e7c56b3", "status": { "state": "PENDING" } }

If the statement鈥檚 result data is too large (for example in this case, by running SELECT l_orderkey, l_extendedprice, l_shipdate FROM lineitem LIMIT 3000000), the result data is chunked and looks like this instead. Note that "...": "..." indicates omitted results here for brevity:

{ "statement_id": "01ed92c5-3583-1f38-b21b-c6773e7c56b3", "status": { "state": "SUCCEEDED" }, "manifest": { "format": "JSON_ARRAY", "schema": { "column_count": 3, "columns": [ { "...": "..." } ] } }, "result": { "chunk_index": 0, "row_offset": 0, "row_count": 432500, "next_chunk_index": 1, "next_chunk_internal_link": "/api/2.0/sql/statements/01ed92c5-3583-1f38-b21b-c6773e7c56b3/result/chunks/1?row_offset=432500", "data_array": [ [ "15997987", "66516.00", "1992-02-12" ], [ "..." ] ] } } Step 2: Get a statement鈥檚 current execution status and data result as JSON

You can use a SQL statement鈥檚 ID to get that statement鈥檚 current execution status and, if the execution succeeded, that statement鈥檚 result. If you forget the statement鈥檚 ID, you can get it from the query history section of the Databricks SQL console, or by calling the Query History API 2.0. For example, you could keep polling this command, checking each time to see if the execution has succeeded.

To get a SQL statement鈥檚 current execution status and, if the execution succeeded, that statement鈥檚 result and an API URL fragment for getting any next chunk of JSON data, run the following command. This command assumes that you have an environment variable on your local development machine named SQL_STATEMENT_ID, which is set to the value of the ID of the SQL statement from the previous step or otherwise provided. Of course, you can substitute ${SQL_STATEMENT_ID} in the following command with the hard-coded ID of the SQL statement.

curl --request GET \ https://${DATABRICKS_HOST}/api/2.0/sql/statements/${SQL_STATEMENT_ID} \ --header "Authorization: Bearer ${DATABRICKS_TOKEN}" \ --output 'sql-execution-response.json' \ && jq . 'sql-execution-response.json' \ && export NEXT_CHUNK_INTERNAL_LINK=$(jq -r .result.next_chunk_internal_link 'sql-execution-response.json') \ && echo NEXT_CHUNK_INTERNAL_LINK=$NEXT_CHUNK_INTERNAL_LINK

If the NEXT_CHUNK_INTERNAL_LINK is set to a non-null value, you can use it to get the next chunk of data, and so on, for example with the following command:

curl --request GET \ https://${DATABRICKS_HOST}${NEXT_CHUNK_INTERNAL_LINK} \ --header "Authorization: Bearer ${DATABRICKS_TOKEN}" \ --output 'sql-execution-response.json' \ && jq . 'sql-execution-response.json' \ && export NEXT_CHUNK_INTERNAL_LINK=$(jq -r .next_chunk_internal_link 'sql-execution-response.json') \ && echo NEXT_CHUNK_INTERNAL_LINK=$NEXT_CHUNK_INTERNAL_LINK

You can keep running the preceding command, over and over again, to get the next chunk and so on. Note that as soon as the last chunk is fetched, the SQL statement is closed. After this closure, you cannot use that statement鈥檚 ID to get its current status or to fetch any more chunks.

Step 3: Fetch large results using external links

This section demonstrates an optional configuration that uses the EXTERNAL_LINKS disposition to retrieve large data sets. The default location (disposition) for the SQL statement result data is within the response payload, but these results are limited to 16 MiB. By setting the disposition to EXTERNAL_LINKS, the response contains URLs you can use to fetch the chunks of the results data with standard HTTP. The URLs point to your workspace鈥檚 internal DBFS, where the result chunks are temporarily stored.

Warning

Databricks strongly recommends that you protect the URLs and tokens that are returned by the EXTERNAL_LINKS disposition.

When you use the EXTERNAL_LINKS disposition, a shared access signature (SAS) URL is generated, which can be used to download the results directly from Azure storage. As a short-lived SAS token is embedded within this SAS URL, you should protect both the SAS URL and the SAS token.

Because SAS URLs are already generated with embedded temporary SAS tokens, you must not set an Authorization header in the download requests.

The EXTERNAL_LINKS disposition can be disabled upon request. To make this request, create a support case. See _.

See also Security best practices.

Note

The response payload output format and behavior, once they are set for a particular SQL statement ID, cannot be changed.

In this mode, the API enables you to store result data only in Apache Arrow format that must be queried separately with HTTP. Also, when using this mode, it is not possible to inline the result data within the response payload.

The following command demonstrates using EXTERNAL_LINKS. Use this pattern instead of the similar query demonstrated in step 1:

curl --request POST \ https://${DATABRICKS_HOST}/api/2.0/sql/statements/ \ --header "Authorization: Bearer ${DATABRICKS_TOKEN}" \ --header "Content-Type: application/json" \ --data '{ "warehouse_id": "'"$DATABRICKS_SQL_WAREHOUSE_ID"'", "catalog": "samples", "schema": "tpch", "format": "ARROW_STREAM", "disposition": "EXTERNAL_LINKS", "statement": "SELECT l_orderkey, l_extendedprice, l_shipdate FROM lineitem LIMIT 100000" }' \ --output 'sql-execution-response.json' \ && jq . 'sql-execution-response.json' \ && export SQL_STATEMENT_ID=$(jq -r .statement_id 'sql-execution-response.json') \ && echo SQL_STATEMENT_ID=$SQL_STATEMENT_ID

The response is as follows:

{ "statement_id": "01ed92c5-3583-1f38-b21b-c6773e7c56b3", "status": { "state": "SUCCEEDED" }, "manifest": { "format": "ARROW_STREAM", "schema": { "column_count": 3, "columns": [ { "name": "l_orderkey", "type_name": "LONG", "position": 0 }, { "name": "l_extendedprice", "type_name": "DECIMAL", "position": 1, "type_precision": 18, "type_scale": 2 }, { "name": "l_shipdate", "type_name": "DATE", "position": 2 } ] }, "total_chunk_count": 1, "chunks": [ { "chunk_index": 0, "row_offset": 0, "row_count": 100000 } ], "total_row_count": 100000, "total_byte_count": 2848312 }, "result": { "external_links": [ { "chunk_index": 0, "row_offset": 0, "row_count": 100000, "external_link": "", "expiration": "" } ] } }

If the request times out, the response looks like this instead:

{ "statement_id": "01ed92c5-3583-1f38-b21b-c6773e7c56b3", "status": { "state": "PENDING" } }

To get that statement鈥檚 current execution status and, if the execution succeeded, that statement鈥檚 result, run the following command:

If the response is large enough (for example in this case, by running SELECT l_orderkey, l_extendedprice, l_shipdate FROM lineitem with no row limit), the response will have multiple chunks, as in the following example below. Note that "...": "..." indicates omitted results here for brevity:

{ "statement_id": "01ed92c5-3583-1f38-b21b-c6773e7c56b3", "status": { "state": "SUCCEEDED" }, "manifest": { "format": "ARROW_STREAM", "schema": { "column_count": 3, "columns": [ { "...": "..." } ] }, "total_chunk_count": 43, "chunks": [ { "chunk_index": 0, "row_offset": 0, "row_count": 737500 }, { "chunk_index": 1, "row_offset": 737500, "row_count": 737500 }, { "...": "..." }, { "chunk_index": 41, "row_offset": 28755596, "row_count": 737500 }, { "chunk_index": 42, "row_offset": 29493096, "row_count": 506699 } ], "total_row_count": 29999795, "total_byte_count": 854409376 }, "result": { "external_links": [ { "chunk_index": 0, "row_offset": 0, "row_count": 737500, "next_chunk_index": 1, "next_chunk_internal_link": "/api/2.0/sql/statements/01ed92c5-3583-1f38-b21b-c6773e7c56b3/result/chunks/1?row_offset=737500", "external_link": "", "expiration": "" } ] } }

To download the stored content鈥檚 results, you can run the following curl command, using the URL in the external_link object and specifying where you want to download the file. Do not include your Azure Databricks token in this command:

curl "" \ --output ""

To download a specific chunk of a streamed content鈥檚 results, first use the chunk_index and row_offset values from the response payload. For example, to get the chunk with a chunk_index of 41 and a row_offset of 28755596 from the previous response, run the following command:

curl --request GET \ https://${DATABRICKS_HOST}/api/2.0/sql/statements/${SQL_STATEMENT_ID}/result/chunks/41?row_offset=28755596 \ --header "Authorization: Bearer ${DATABRICKS_TOKEN}" \ --output 'sql-execution-response.json' \ && jq . 'sql-execution-response.json'

Note

Running the preceding command returns a new SAS URL.

Arbitrary row_offset values cannot be specified. You can only use row_offset values that are contained within the manifest object.

To download the stored chunk, use the URL in the external_link object.

For more information about the Apache Arrow format, see:

IPC Streaming Format Writing and Reading Streaming Format Using streams Step 4: Cancel a SQL statement鈥檚 execution

If you need to cancel a SQL statement that has not yet succeeded, run the following command:

curl --request POST \ https://${DATABRICKS_HOST}/api/2.0/sql/statements/${SQL_STATEMENT_ID}/cancel \ --header "Authorization: Bearer ${DATABRICKS_TOKEN}" Security best practices

The Databricks SQL Statement Execution API increases the security of data transfers by using end-to-end transport layer security (TLS) encryption and short-lived credentials such as SAS tokens.

There are several layers in this security model. At the transport layer, it is only possible to communicate with the Databricks SQL Statement Execution API by using TLS 1.2 or above. Also, callers of the Databricks SQL Statement Execution API must be authenticated with a valid Azure Databricks personal access token or Azure Active Directory (Azure AD) token that maps to a user who has the entitlement to use Databricks SQL. This user must have Can Use access for the specific SQL warehouse that is being used, and access can be restricted with IP access lists. This applies to all requests to the Databricks SQL Statement Execution API. Furthermore, for executing statements, the authenticated user must have permission to the data objects (such as tables, views, and functions) that are used in each statement. This is enforced by existing access control mechanisms in Unity Catalog or by using table ACLs; see the Data governance guide. This also means that only the user who executes a statement can make fetch requests for the statement鈥檚 results.

Databricks recommends the following security best practices whenever you use the Databricks SQL Statement Execution API along with the EXTERNAL_LINKS disposition to retrieve large data sets:

Remove the Databricks authorization header for Azure storage requests Protect SAS URLs and SAS tokens

The EXTERNAL_LINKS disposition can be disabled upon request by creating a support case. To make this request, create a support case. See _.

Remove the Databricks authorization header for Azure storage requests

All calls to the Databricks SQL Statement Execution API must include an Authorization header that contains Azure Databricks access credentials. Do not include this Authorization header whenever you download data from Azure storage. This header is not required and might unintentionally expose your Azure Databricks access credentials.

Protect SAS URLs and SAS tokens

Whenever you use the EXTERNAL_LINKS disposition, a short-lived SAS URL is generated, which the caller can use to download the results directly from Azure storage by using TLS. As a short-lived SAS token is embedded within this SAS URL, you should protect both the SAS URL and SAS token.

【本文地址】

公司简介

联系我们