Spark SQL译文

佚名 7年前 (2019-04-24) 随笔 1946人围观抢沙发百度已收录

Starting Point: SparkSession
Spark中所有功能的入口点是SparkSession类。
要创建基本的SparkSession，只需使用SparkSession.builder:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
在Spark repo中的“examples / src / main / python / sql / basic.py”中找到完整的示例代码。
Spark 2.0中的SparkSession为Hive功能提供内置支持，包括使用HiveQL编写查询，访问Hive UDF以及从Hive表读取数据的功能。
要使用这些功能，您无需拥有现有的Hive设置
Creating DataFrames:
使用SparkSession，应用程序可以从现有的RDD，Hive表或Spark数据源创建DataFrame。
示例，以下内容基于JSON文件的内容创建DataFrame
# spark is an existing SparkSession
df = spark.read.json("examples/src/main/resources/people.json")
# Displays the content of the DataFrame to stdout
df.show()
# +----+-------+
# | age| name|
# +----+-------+
# |null|Michael|
# | 30| Andy|
# | 19| Justin|
# +----+-------+

无类型数据集操作（又名DataFrame操作）

SRE实战互联网时代守护先锋，助力企业售后服务体系运筹帷幄！一键直达领取阿里云限量特价优惠。在Python中，可以通过属性（df.age）或索引（df ['age']）访问DataFrame的列。 ·虽然前者便于交互式数据探索，但强烈建议用户使用后一种形式，这是未来的证明，不会破坏也是DataFrame类属性的列名

# spark, df are from the previous example
# Print the schema in a tree format df.printSchema() # root # |-- age: long (nullable = true) # |-- name: string (nullable = true) # Select only the "name" column df.select("name").show() # +-------+ # | name| # +-------+ # |Michael| # | Andy| # | Justin| # +-------+ # Select everybody, but increment the age by 1 df.select(df['name'], df['age'] + 1).show() # +-------+---------+ # | name|(age + 1)| # +-------+---------+ # |Michael| null| # | Andy| 31| # | Justin| 20| # +-------+---------+ # Select people older than 21 df.filter(df['age'] > 21).show() # +---+----+ # |age|name| # +---+----+ # | 30|Andy| # +---+----+ # Count people by age df.groupBy("age").count().show() # +----+-----+ # | age|count| # +----+-----+ # | 19| 1| # |null| 1| # | 30| 1| # +----+-----+