Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
S
sentiment-analysis-ai
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package Registry
Container Registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Cole Walton
sentiment-analysis-ai
Compare revisions
4c932b3825e0b4e82c64915ca8226e6bb14b6972 to 6398ab1d3e08e2586ff8c8ee2ff8147b5c7d3ac2
Compare revisions
Changes are shown as if the
source
revision was being merged into the
target
revision.
Learn more about comparing revisions.
Source
bornahokie/sentiment-analysis-ai
Select target project
No results found
6398ab1d3e08e2586ff8c8ee2ff8147b5c7d3ac2
Select Git revision
Branches
main
Swap
Target
bornahokie/sentiment-analysis-ai
Select target project
bornahokie/sentiment-analysis-ai
1 result
4c932b3825e0b4e82c64915ca8226e6bb14b6972
Select Git revision
Branches
main
Show changes
Only incoming changes from source
Include changes to target since source was created
Compare
Commits on Source (2)
Fixed the query system of WET files
· fdc41e00
Cole Walton
authored
1 year ago
fdc41e00
idk why git is mad at me
· 6398ab1d
Cole Walton
authored
1 year ago
6398ab1d
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
4664.code-workspace
+12
-0
12 additions, 0 deletions
4664.code-workspace
webscrape.py
+20
-0
20 additions, 0 deletions
webscrape.py
with
32 additions
and
0 deletions
4664.code-workspace
0 → 100644
View file @
6398ab1d
{
"folders": [
{
"path": ".."
},
{
"name": "sentiment-analysis-ai",
"path": "."
}
],
"settings": {}
}
\ No newline at end of file
This diff is collapsed.
Click to expand it.
webscrape.py
0 → 100644
View file @
6398ab1d
import
requests
from
warcio
import
ArchiveIterator
import
sys
sys
.
stdout
.
reconfigure
(
encoding
=
'
utf-8
'
)
##Could create a webscraping aspect to this that would query all of the warc urls for each of the different news and media urls
wet_url
=
'
https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-23/segments/1685224649986.95/wet/CC-MAIN-20230604125132-20230604155132-00544.warc.wet.gz
'
r
=
requests
.
get
(
wet_url
,
stream
=
True
)
records
=
ArchiveIterator
(
r
.
raw
)
record
=
next
(
records
)
assert
record
.
rec_type
==
'
warcinfo
'
text
=
record
.
content_stream
().
read
()
print
(
text
.
decode
(
'
utf-8
'
,
errors
=
'
ignore
'
))
for
record
in
records
:
record
=
next
(
records
)
if
((
record
.
rec_headers
.
get_header
(
'
Content-Length
'
)
<
'
5000
'
)
and
record
.
rec_headers
.
get_header
(
'
WARC-Identified-Content-Language
'
)
==
'
eng
'
):
text
=
record
.
content_stream
().
read
()
print
(
text
.
decode
(
'
utf-8
'
,
errors
=
'
ignore
'
))
This diff is collapsed.
Click to expand it.