Wrangling Court Data on a National Level The agenda Who am I? - - PowerPoint PPT Presentation

▶

Oct 20, 2022 255 likes •471 views

A presentation by Mike Lissner creator of CourtListener.com and Juriscraper Wrangling Court Data on a National Level The agenda Who am I? What is CourtListener? What is Juriscraper? How does it work? What does it do?

SLIDE 1

Wrangling Court Data on a National Level

A presentation by

Mike Lissner

creator of

CourtListener.com

and

Juriscraper

SLIDE 2

The agenda

Who am I?
What is CourtListener?
What is Juriscraper?
How does it work?
What does it do?
How can you contribute?
What's the future hold?

SLIDE 3

Me

Mike Lissner
Not:
A lawyer
A computer scientist
Am:
Grad from UC Berkeley School of Information
Employee of a search company you may know
Open source/access enthusiast
Have blog at http://michaeljaylissner.com

SLIDE 4

CourtListener Background

Started in 2010
Aggregates data and provides alerts
Powerful search engine
Data dumps
Citation linking (see Rowyn's presentation!)
Free. Free. Free.
Demo

SLIDE 5

Juriscraper

Our main topic du jour.
A newer project used live on CourtListener
A simple open source scraper that anybody can

use

SLIDE 6

Juriscraper's Features

Extensibility
Solid, modern code
Character detection and normalization
Simple installation
Harmonization
Sophisticated title casing
Sanity checking and hard failures

SLIDE 7

Extensibility

Supports:
Varied geographies (countries, states, federal)
Languages
Media types (video, oral arguments, text)
Currently has scrapers for:
Federal Appeals courts
Some states
Some special jurisdictions
Some back scrapers

SLIDE 8

Modern Code

Requires: DRY, OO, PEP8
Uses:
Python 2.7
lxml and XPath
Requests
chardet

SLIDE 9

Character Encodings

Detects the declaration in XML or HTML pages
If that's missing, then sniffs the encoding based
n the binary data.
Normalizes everything to UTF-8

SLIDE 10

Harmonization

Words like, “et al, appellant, executor”, etc. all get removed.
All forms of “USA” get normalized (U.S.A., U.S., United

States, US, etc.)

All forms of “vs” get normalized.
Text gets titlecased if needed (much harder than it seems!)
Junk punctuation gets removed/replaced
Dates get converted to Python objects and results are

guaranteed in reverse chronological order.

SLIDE 11

Sanity Checking and Hard Failures

Court websites change frequently
If our meta data is bad, we should fail

completely and loudly

SLIDE 12

Integrating Juriscraper

aka

“All about the Caller”

You have to build a “caller”
You'll want:
Duplicate detection
Minimal impact on court websites
Mimetype detection
OCR
PDF “Decryption”

SLIDE 13

Duplicate Detection

Test if the site has changed using a hash
If so, extract the meta data from the page using Juriscraper.
Iterate over the items, download their text or binary.
If a hash of the text or binary is new, save the item and proceed

to the next

Else, dup_count++
If proceeding, check the date of the next item.
If prior to the dup we found, terminate.
Else check a hash on the next item.
If five dup_count == 5, terminate.

SLIDE 14

Impact Minimization

Methods:
Reasonable duplicate detection algorithms
User-agent set to “juriscraper”
Free sharing of data via our API

SLIDE 15

Mimetypes, OCR and PDFs

Mimetypes can be detected via “magic

numbers”

Text can then be extracted.
If no text, use OCR.
If text is garbled, try “decrypting” it

This would be awful, but...

SLIDE 16

We built a sample caller.

Two, actually.

SLIDE 17

Getting involved

No more siloed scrapers!
All code is open source (BSD license)
Installation is simple (five minutes using pip)
We built some custom tools to make

development easier.

Looking for:
More users
More developers

SLIDE 18

Why this is important

Scaling is vital.
More callers means:
More jurisdictions
Faster response times
Improved code
A unified court scraper (user-agent)

SLIDE 19

Juriscraper's Future

Better alerts for downed scrapers
Court-level rate throttling
HTML tidying
API Refactoring
More courts!
More backscrapers
More unit tests

SLIDE 20

Juriscraper Demo/walkthrough

SLIDE 21

Thank you.

http://courtlistener.com/
https://bitbucket.org/mlissner/search-and-

awareness-platform-courtlistener/

https://bitbucket.org/mlissner/juriscraper/
http://michaeljaylissner.com/