[Boards: 3 / a / aco / adv / an / asp / b / biz / c / cgl / ck / cm / co / d / diy / e / fa / fit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mu / n / news / o / out / p / po / pol / qa / r / r9k / s / s4s / sci / soc / sp / t / tg / toy / trash / trv / tv / u / v / vg / vp / vr / w / wg / wsg / wsr / x / y ] [Home]
4chanarchives logo
Extract words from a text file
Images are sometimes not shown due to bandwidth/network limitations. Refreshing the page usually helps.

You are currently reading a thread in /g/ - Technology

Thread replies: 66
Thread images: 19
File: 56452980_l.jpg (114 KB, 660x506) Image search: [Google]
56452980_l.jpg
114 KB, 660x506
Say I have a big .txt file with information structured like this:

"name":"Hitler","Stallman likes to eat dead skin off of his feet";"name":" "Dick_Butt","I like turles";"name":"Ahmed_Muhammed":"I made a clock!";"name":"anon","Install gentoo.";"name":"12345","435464523"

I want to make a script that looks for the word "clock", and if it finds it it should save the name to the left of it to a text file, in this case "Ahmed_Muhammed", then continue looking for more instances of "clock" until it reach the end of the .txt file.

Can this be done with a batch file?

Pic unrelated
>>
>>51546770
Your inconsistent use of delimiters indicates that you are a huge faggot.
>>
File: regular_expressions.png (105 KB, 600x607) Image search: [Google]
regular_expressions.png
105 KB, 600x607
>>
Somehow
>>
I won't tell
>>
>>51546802
Yeah? Is it possible to save the output to "output.txt" with a batch file?

>>51546827
You sure? Anon above says it's possible.
>>
cat input | tr ';' '\n' | grep -o "\"name\":\"\([[:alpha:]]\|_\)*\":\".*clock.*\"" | cut -d '"' -f 4 > output


Very ad hoc.
>>
>>51546865
> yeah
No
> you sure
It's a secret
>>
>>51546845
>Your inconsistent use of delimiters indicates that you are a huge faggot.

I didn't make the data set, the above is just an example based on the same structure.

Also follow up question: I have roughly 1TB of this data, will a batch file be able to handle that kind of amount you think?
>>
>>51546894
Have you try ed zsh?
>>
File: 1448538443888.png (1 MB, 1864x1792) Image search: [Google]
1448538443888.png
1 MB, 1864x1792
I guess you could explode on : and then strip "" from your entries in the array or something.
>Go fuck yourself though.
>>
File: 200.gif (4 MB, 369x200) Image search: [Google]
200.gif
4 MB, 369x200
>>51546914
>Have you try ed zsh?

I'm on windows, so I can't do shell scripting.
>>
>>51546950
Cygwin/MSYS/Babun. Yes you can.
>>
>>51546950
http://gnuwin32.sourceforge.net/packages/coreutils.htm

Also, just get a VM. Or powershell. And get some motivation. Nobody will do your work for you, hopefully.
>>
>>51546950
Hahahahaahahahahahqhahahahahahahahahahahahahqhqhqhahahqhahahahahahahahahahahahahhahahahahahaahhahahahahahahahahahahahhahahahahahahahahahahahahahahahahahahahahahahahahahahahhahahahahahahaahhahahahahahahahahahahahahahahahhahahahahahahahahahahhahahahahahahahhahaahhahahhahahahahahahahaahahahhahahaahahhaahahahahahhahahahaahahahhahahahahahhahahahahhahahahahahahhahahahahahahahahahahhahahahahaha
>>
File: 1339433390601.jpg (56 KB, 374x292) Image search: [Google]
1339433390601.jpg
56 KB, 374x292
>>51546881
>unix

I'm on Windows 7.
>>
Curried troll thread
>>
>>51546950
Yes you can.
>>
File: 1364413468880.jpg (142 KB, 800x776) Image search: [Google]
1364413468880.jpg
142 KB, 800x776
>>51546992
Trust me, I'm trying my best.
>>
>>51547049
Windows batch scripts are useless compared to bash + coreutils.
>>
>>51546992
You're not tricking me into installing Linux again /g/. But I've installed coreutils. What now? How do start this thing?
>>
>>51547020
Install gentoo.

(use mingw32 or something)
>>
>>51547114
>use mingw32 or something

Actually, I have Eclipse installed. Maybe I could do it in Java? I took a Java class back in the day, with a little help I think I could pull it off.
>>
>>51547153
>java
R u meming me again
>>
From my understanding text files, and any file for that matter, is 100% immutable and all you can do is turn it on and off
>>
>>51547020
Just boot a livecd with ntfs-3g installed, mount your partition and use the script faggot
>>
>>51547108
Install Perl or Python.
>>
File: 1446159616006.png (362 KB, 700x700) Image search: [Google]
1446159616006.png
362 KB, 700x700
>>51546942
I don't know what that means. I was thinking doing it like this:

>Detect word "clock"
>read six characters to the left of the word
>check if the string == "name"
>If it does not, jump one character to the left, read six characters, check if the string == "name"
>Do this until the string == "name" is found
>Jump two characters to the right
>Read 1 character
>Check if the character == "
>If it's not, check if the character == one of n numbers of ascii characters
>When the character == an ascii character, save this character to a file then move one character to the right, check if the character == ", if it isn't do the previous step
>Do this until the character == "
>Continue searching for more "clock" after the "

I know it's probably a shit way to do it, but in my head it seems like it could work??
>>
File: le lebron face.jpg (139 KB, 600x599) Image search: [Google]
le lebron face.jpg
139 KB, 600x599
>>51547369
>all that shit
Bruh.
>>
File: 1446669132929.jpg (37 KB, 400x447) Image search: [Google]
1446669132929.jpg
37 KB, 400x447
>>51547387
Do you know a better way anon? because I don't and I'm trying very hard here.
>>
>>51547369
>using set numbers for a variable

Damn son, I failed the shit out of intro to programing but even im not that dumb
>>
File: face of shitposting.gif (2 MB, 500x500) Image search: [Google]
face of shitposting.gif
2 MB, 500x500
>>51547425
Yeah
>>51547425

Also I'm sure you can just blindly cut on delimiters (unless they can be present if quoted/escaped, then you need to be smart about it). And I know nothing about Java but it probably can read a line at a time and do regular expressions.
>>
File: 1448562203025.jpg (115 KB, 960x719) Image search: [Google]
1448562203025.jpg
115 KB, 960x719
>>51547369
>>
>>51547468
Meant to quote >>51546881 m8.
>>
Here is some of the real data I'm working with. It's roughly 1.7 billion reddit comments formatted like this:

{"score_hidden":false,"name":"t1_cnas8zv","link_id":"t3_2qyr1a","body":"Most of us have some family members like this. *Most* of my family is like this. ","downs":0,"created_utc":"1420070400","score":14,"author":"YoungModern","distinguished":null,"id":"cnas8zv","archived":false,"parent_id":"t3_2qyr1a","subreddit":"exmormon","author_flair_css_class":null,"author_flair_text":null,"gilded":0,"retrieved_on":1425124282,"ups":14,"controversiality":0,"subreddit_id":"t5_2r0gj","edited":false}
{"distinguished":null,"id":"cnas8zw","archived":false,"author":"RedCoatsForever","score":3,"created_utc":"1420070400","downs":0,"body":"But Mill's career was way better. Bentham is like, the Joseph Smith to Mill's Brigham Young.","link_id":"t3_2qv6c6","name":"t1_cnas8zw","score_hidden":false,"controversiality":0,"subreddit_id":"t5_2s4gt","edited":false,"retrieved_on":1425124282,"ups":3,"author_flair_css_class":"on","gilded":0,"author_flair_text":"Ontario","subreddit":"CanadaPolitics","parent_id":"t1_cnas2b6"}
{"score_hidden":false,"link_id":"t3_2qxefp","name":"t1_cnas8zx","created_utc":"1420070400","downs":0,"body":"Mine uses a strait razor, and as much as i love the clippers i love the razor so much more. Then he follows it up with a warm towel. \nI think i might go get a hair cut this week.","distinguished":null,"id":"cnas8zx","archived":false,"author":"vhisic","score":1,"subreddit":"AdviceAnimals","parent_id":"t3_2qxefp","retrieved_on":1425124282,

I'm trying to extract the user name for posts that contain certain words I'm interested in.
>>
>>51546770
>I have roughly 1TB of this data,
Youi gonna be busy a looooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooonnnnnnnnnnnnnnnnnnnnnnnnnnnnnnng time if you insist on doing this blindly in Windows, nigga.
>>
File: 1381194374384.jpg (56 KB, 597x519) Image search: [Google]
1381194374384.jpg
56 KB, 597x519
>>51547566
Then help me anon, you're my only hope!

I should mention that if I get this up and running, I intend to use it to PM anyone who have posted about specific rare disease on reddit. It'll help a lot of people.
>>
>>51547613
Debian man get moving
>>
>>51547566
Also I have a server I can run it on 24/7, if it takes a year to complete that's fine. I just need to get it working.
>>
>>51547636
I have CentOS in a VM, can I use that? If so I'll boot it right up.
>>
>>51547656
yes or I think there is sed for windows. If there is:
cat file|tr ';' '\n'|sed '/clock/!d;s/^[^:]*:"\([^"]*\)".*/\1/'

would work
>>
>>51547656
You can but it's gona be slower than tuning it natively do sent matter what destroy you use I would use Debian for something like this or xbuntu
>>
>>51547613
I posted this in another thread but this has enough to get you started.
https://www.youtube.com/watch?v=smbeKPDVs2I
The topic isn't what is important, it's the commands you want to take note of.
>>
>>51547701
I've tried to install The GnuWin port of Sed on windows, but I can't get it to work.

I'm booting into CentOS now.
>>
You could do it within a blink using R

But you would need to sort that data properly (aka ":" dont belong into csv files)
>>
i could do that in 2 minutes

get out
>>
>>51547715
I'm running Windows on my server, that's why I was hoping for a Win solution. If a VM turns out to be too slow, I'll dualboot linux on it no problem.
>>
File: 1426774062844.jpg (33 KB, 530x444) Image search: [Google]
1426774062844.jpg
33 KB, 530x444
>>51547875
I wish I were as pro as you brah
>>
File: test.png (33 KB, 785x493) Image search: [Google]
test.png
33 KB, 785x493
>>51547701
I tried, what am I doing wrong here?
>>
>>51547978
oh that was for the file in the op, for
>>51547554
you'd need something different.
if all of the '{}''s are on different lines, just do
cat sample|sed '/body":.*searchterm[&"]*/!d;s/.*name":"\([^"]*\).*/\1/'
otherwise add tr '{}' '\n\n'
>>
File: data.png (292 KB, 1893x922) Image search: [Google]
data.png
292 KB, 1893x922
>>51548073
All the '{}''s are on different lines.

In the pic I posted, the same error comes regardless of the content of the text file.

Am I correct in assuming I should substitute the "sample" for the path to the file containing the data?

Pic related is how the data is stored.
>>
it's called json ya fag.

make a simple python script
>>
>>51548161
try putting single quotes around or a \ before all !
>>
>>51546770
>batch
fuck off
grep/regex should do what you want
>>
>>51548073
How do you know all that regex crap but not know that you can just do
sed s/example// file.txt
, rather than
cat file.txt | sed s/example//
>>
>>51547554
m8, that's JSON
>>
>>51547554
>Damn son, I failed the shit out of intro to programing but even im not that dumb
thats json buddy
just use python's json module and you're done
>>
File: error.png (179 KB, 1441x550) Image search: [Google]
error.png
179 KB, 1441x550
>>51548205
I somehow managed to reference an empty data file. Now i get this error though (see pic).
>>
>>51548179
this
should be relatively simple
psuedo-code, don't take it literally
there's no way this'll work since I don't know the exact format of the file
import json
with open('your_file', 'rb') as infile:
file_as_text = infile.read().decode('utf-8')
j = json.loads(file_as_text)
if 'clock in j':
print("%s mentioned clock' % j['author']")
>>
>>51548425
>
if 'clock in j':

correction
if 'clock' in j['comment']:
>>
>>51548425
>
print("%s mentioned clock' % j['author']")

another correction
print("%s mentioned clock" % j['author'])

either way, it doesn't matter since pseudo-code
the point was to show that this is ridiculously simple
>>
File: Yiw4qfu.jpg (391 KB, 1200x1200) Image search: [Google]
Yiw4qfu.jpg
391 KB, 1200x1200
>>51548425
>>51548440
>>51548494

The files are a series of JSON blocks delimited by new lines (\n). The files them selves have no ending, but they open fine in UltraEdit.

If it's as simple as you say, can I pay you some bitcoin to slap a python script together for me?
>>
File: 1382803831078.png (22 KB, 300x188) Image search: [Google]
1382803831078.png
22 KB, 300x188
bump
>>
>reinventing a json parser
>not just import json and continue with life

And you guys make fun of python where you just
>import program
>>
>>51548939
How is this done exactly? Say I want to find all the users who have posted the word "orange" in the data set?
>>
>>51548997
Google it. Or just fuck off back to reddit already.
Thread replies: 66
Thread images: 19

banner
banner
[Boards: 3 / a / aco / adv / an / asp / b / biz / c / cgl / ck / cm / co / d / diy / e / fa / fit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mu / n / news / o / out / p / po / pol / qa / r / r9k / s / s4s / sci / soc / sp / t / tg / toy / trash / trv / tv / u / v / vg / vp / vr / w / wg / wsg / wsr / x / y] [Home]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.
If a post contains personal/copyrighted/illegal content you can contact me at [email protected] with that post and thread number and it will be removed as soon as possible.
DMCA Content Takedown via dmca.com
All images are hosted on imgur.com, send takedown notices to them.
This is a 4chan archive - all of the content originated from them. If you need IP information for a Poster - you need to contact them. This website shows only archived content.