11 Scraping Websites | 151 |
12 Working with HTTP APIs | 160 |
13 Fork-Join Parallelism with Futures | 168 |
14 Simple Web and API Servers | 177 |
15 Querying SQL Databases | 185 |
The third part of this book covers using Scala in a world of servers and clients, systems and services. We will explore using Scala both as a client and as a server, exchanging HTML and JSON over HTTP or Websockets. This part builds towards two capstone projects: a parallel web crawler and an interactive chat website, each representing common use cases you are likely to encounter using Scala in a networked, distributed environment.
11.1 Scraping Wikipedia | 152 |
11.2 MDN Web Documentation | 153 |
11.3 Scraping MDN | 154 |
11.4 Putting it Together | 155 |
11.5 Conclusion | 155 |
@ val
doc =
Jsoup.
connect(
"http://en.wikipedia.org/"
)
.
get(
)
@ doc.
title(
)
res2:
String
=
"Wikipedia, the free encyclopedia"
@ val
headlines =
doc.
select(
"#mp-itn b a"
)
headlines:
select.
Elements =
<
a href=
"/wiki/Bek_Air_Flight_2100"
title=
"Bek Air Flight 2100"
>
Bek Air Flight 2100
<
/
a>
<
a href=
"/wiki/Assassination_of_..."
title=
"Assassination of ..."
>
2018
killing<
/
a>
<
a href=
"/wiki/State_of_the_..."
title=
"State of the..."
>
upholds a ruling<
/
a>
.
.
.
</> 11.1.scala
Snippet 11.1: scraping Wikipedia's front-page links using the Jsoup third-party library in the Scala REPL
The user-facing interface of most networked systems is a website. In fact, often that is the only interface! This chapter will walk you through using the Jsoup library from Scala to scrape human-readable HTML pages, unlocking the ability to extract data from websites that do not provide access via an API.
Apart from third-party scraping websites, Jsoup is also a useful tool for testing the HTML user interfaces that we will encounter in Chapter 14: Simple Web and API Servers. This chapter is also a chance to get more familiar with using Java libraries from Scala, a necessary skill to take advantage of the broad and deep Java ecosystem. Lastly, it is an exercise in doing non-trivial interactive development in the Scala REPL, which is a great place to prototype and try out pieces of code that are not ready to be saved in a script or project.
12.1 The Task: Github Issue Migrator | 161 |
12.2 Creating Issues and Comments | 162 |
12.3 Fetching Issues and Comments | 163 |
12.4 Migrating Issues and Comments | 164 |
12.5 Conclusion | 165 |
@ requests.
post(
"https://api.github.com/repos/lihaoyi/test/issues"
,
data =
ujson.
Obj(
"title"
->
"hello"
)
,
headers =
Map(
"Authorization"
->
s"token $token")
)
res1:
requests.
Response =
Response(
"https://api.github.com/repos/lihaoyi/test/issues"
,
201
,
"Created"
,
.
.
.
</> 12.1.scala
Snippet 12.1: interacting with Github's HTTP API from the Scala REPL
HTTP APIs have become the standard for any organization that wants to let external developers integrate with their systems. This chapter will walk you through how to access HTTP APIs in Scala, building up to a simple use case: migrating Github issues from one repository to another using Github's public API.
We will build upon techniques learned in this chapter in Chapter 13: Fork-Join Parallelism with Futures, where we will be writing a parallel web crawler using the Wikipedia JSON API to walk the graph of articles and the links between them.
13.1 Parallel Computation using Futures | 169 |
13.2 N-Ways Parallelism | 170 |
13.3 Parallel Web Crawling | 171 |
13.4 Asynchronous Futures | 172 |
13.5 Asynchronous Web Crawling | 173 |
13.6 Conclusion | 174 |
def
fetchAllLinksParallel(
startTitle:
String
,
depth:
Int
)
:
Set[
String
]
=
{
var
seen =
Set(
startTitle)
var
current =
Set(
startTitle)
for
(
i <-
Range(
0
,
depth)
)
{
val
futures =
for
(
title <-
current)
yield
Future{
fetchLinks(
title)
}
val
nextTitleLists =
futures.
map(
Await.
result(
_,
Inf)
)
current =
nextTitleLists.
flatten.
filter(
!
seen.
contains(
_)
)
seen =
seen ++
current
}
seen
}
</> 13.1.scala
Snippet 13.1: a simple parallel web-crawler implemented using Scala Futures
The Scala programming language comes with a Futures API. Futures make parallel and asynchronous programming much easier to handle than working with traditional techniques of threads, locks, and callbacks.
This chapter dives into Scala's Futures: how to use them, how they work, and how you can use them to parallelize data processing workflows. It culminates in using Futures together with the techniques we learned in Chapter 12: Working with HTTP APIs to write a high-performance concurrent web crawler in a straightforward and intuitive way.
14.1 A Minimal Webserver | 178 |
14.2 Serving HTML | 179 |
14.3 Forms and Dynamic Data | 180 |
14.4 Dynamic Page Updates via API Requests | 181 |
14.5 Real-time Updates with Websocks | 182 |
14.6 Conclusion | 183 |
object
MinimalApplication extends
cask.
MainRoutes {
@cask
.
get(
"/"
)
def
hello(
)
=
{
"Hello World!"
}
@cask
.
post(
"/do-thing"
)
def
doThing(
request:
cask.
Request)
=
{
request.
text(
)
.
reverse
}
initialize(
)
}
</> 14.1.scala
Snippet 14.1: a minimal Scala web application, using the Cask web framework
Web and API servers are the backbone of internet systems. While in the last few chapters we learned to access these systems from a client's perspective, this chapter will teach you how to provide such APIs and Websites from the server's perspective. We will walk through a complete example of building a simple real-time chat website serving both HTML web pages and JSON API endpoints. We will re-visit this website in Chapter 15: Querying SQL Databases, where we will convert its simple in-memory datastore into a proper SQL database.
15.1 Setting up Quill and PostgreSQL | 186 |
15.2 Mapping Tables to Case Classes | 187 |
15.3 Querying and Updating Data | 188 |
15.4 Transactions | 189 |
15.5 A Database-Backed Chat Website | 190 |
15.6 Conclusion | 191 |
@ ctx.
run(
query[
City]
.
filter(
_.
population >
5000000
)
.
filter(
_.
countryCode ==
"CHN"
)
)
res16:
List[
City]
=
List(
City(
1890
,
"Shanghai"
,
"CHN"
,
"Shanghai"
,
9696300
)
,
City(
1891
,
"Peking"
,
"CHN"
,
"Peking"
,
7472000
)
,
City(
1892
,
"Chongqing"
,
"CHN"
,
"Chongqing"
,
6351600
)
,
City(
1893
,
"Tianjin"
,
"CHN"
,
"Tianjin"
,
5286800
)
)
</> 15.1.scala
Snippet 15.1: using the Quill database query library from the Scala REPL
Most modern systems are backed by relational databases. This chapter will walk you through the basics of using a relational database from Scala, using the Quill query library. We will work through small self-contained examples of how to store and query data within a Postgres database, and then convert the interactive chat website we implemented in Chapter 14: Simple Web and API Servers to use a Postgres database for data storage.